cat_hnswlib - Fork of the Hnswlib with support of categorical filtering.
New categorical methods:
add_tags(labels, tag)
- assigntag
to the specifiedlabels
get_tags(label)
- returns list of tags, assigned to thelabel
reset_tags()
- drop all tag-related information including additionaly built linksindex_tagged(tag, m)
- build additional navigation graph among tagged points withtag
. Ensure connectiviti of conditional searchindex_cross_tagged(tags, m)
- build additional navigation graph among tagged points withtags
. Does not create new entrypoints. Useful for creating geo-index and numerical ranges.knn_query(data, k = 1, num_threads = -1, conditions = [])
- extended with parameretconditions
. It defines what points to include in search results. Performs traversal starting from the first point which fulfills condition. Example(A | !B) & C
is represented as[[(0, A), (1, B)], [(0, C)]]
, where A, B, C loginal clauses if respective tag is assigned to a point.[[(0, 55)]]
- means find closest point with tag 55.
pip install --no-binary :all: 'git+https://github.com/generall/cat_hnswlib.git#subdirectory=python_bindings'
import hnswlib
import numpy as np
from collections import defaultdict
import tqdm
dim = 50
elements = 10_000
parts_count = 100
hnsw = hnswlib.Index(space='cosine', dim=dim)
hnsw.init_index(max_elements = elements, ef_construction = 10, M = 16, random_seed=45)
points = np.random.rand(elements, dim)
hnsw.add_items(points)
# Assign tags by divisibility, for example
tags = defaultdict(list)
for i in range(elements):
tags[i % parts_count].append(i)
for tag, ids in tqdm.tqdm(tags.items()):
hnsw.add_tags(ids, tag)
hnsw.index_tagged(tag, m=8)
target = np.float32(np.random.random((1, dim)))
condition = [[(False, 66)]]
# Result will only include points with id % 66 == 0
found_labels, found_dist = hnsw.knn_query(target, k=10, conditions=condition)
- Query planner
- decide on search strategy depending on the amount of data-points covered by given condition
- If number of points is small - use full scan
- If there is large categories - search in them with graph
- if there is a large number of small or-conditioned categories - search independently and parallel
- decide on search strategy depending on the amount of data-points covered by given condition