Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer array type conversion in array_dist to compute hamming distance #191

Merged

Conversation

therealdarkknight
Copy link
Contributor

@therealdarkknight therealdarkknight commented Oct 3, 2023

The hamming_dist SQL function takes in INTEGER[] parameters, but we immediately cast the parameter arrays to a float4 array in the function array_dist. This can cause the hamming distance to be improperly computed.

I first noticed this because I saw the hamming_dist distance values in certain queries would not be ordered non-decreasingly down the column, as they should be if we're ordering vectors by nearest distance. This indicates that the distance computed by this SQL function was not the same as what usearch internally calls on our vectors and uses to sort distances. I believe this PR fixes that.

For an example of the current hamming_dist function not computing distances that are properly ordered in queries, change the last vector of small_world_array.sql in the tests to be {1,4,5} instead, and run the hnsw_dist_func test.

Tests might fail because hamming_distance values displayed in queries will change.

@dqii
Copy link
Contributor

dqii commented Oct 3, 2023

Can you update the tests? I believe there's a test case in hnsw_todo.sql that mentions this. It could be moved to hnsw_dist_func if this is resolved.

Curious, did DatumGetSizedFloatArray not address this?

@therealdarkknight
Copy link
Contributor Author

Curious, did DatumGetSizedFloatArray not address this?

It seems to me that DatumGetSizedFloatArray converts any INTEGER[] arrays in the vector column to a float4* array only for the purpose of inserting the vector into our index. But the SQL distance functions run last in a query. Usearch sorts vectors according to distance, and in amgettuple we return a heap pointer (for index scans). Postgres then takes the vector from the table heap, which is still an original INTEGER[] array and finally feeds it into the distance function. Hence, we need to convert again.

This also means we are computing distances twice (recomputing them for the top candidates returned after usearch), but I think this is the nature of integrating with postgres. Maybe could be a possible optimization later to keep the distances returned by usearch.

Can you update the tests? I believe there's a test case in hnsw_todo.sql that mentions this. It could be moved to hnsw_dist_func if this is resolved.

Will do.

@@ -330,7 +356,7 @@ Datum hamming_dist(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
PG_RETURN_INT32(array_dist(a, b, usearch_metric_hamming_k));
PG_RETURN_INT32((int32)array_dist(a, b, usearch_metric_hamming_k));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is returning the distance between vectors and not the vectors themselves. Why should this distance be an integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hamming distance is always an integer and we also pass it to PG_RETURN_INT32 which returns an integer in postgres. So I added this explicit cast since array_dist returns a float4

src/hnsw.c Outdated
bx = (float4*)ARR_DATA_PTR(b);
}

float4 result = usearch_dist(ax, bx, metric_kind, a_dim, usearch_scalar_f32_k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this really solves the issue. We should not be measuring hamming distance of float arrays. We just get lucky here that when we cast ints to floats, they end up having the same byte arrangement and hamming distance calculation (an inherently, bitwise operation) results in the same number it would result in if we directly passed int arrays.

the right approach is to have usearch_dist support integer or byte typed arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I can modify usearch_dist to take an integer typed array and compute hamming distance appropriately. Let me know

…l use in the index (builds, inserts, scans), and updated tests
@therealdarkknight
Copy link
Contributor Author

therealdarkknight commented Oct 10, 2023

It turns out that usearch's hamming distance function doesn't care about scalar type, and we can pass int32* or float4* arrays into it and it will work regardless. Also, the casting in usearch is such that we can pass an int32* array alongside the usearch_scalar_f32_k scalar type and it will be fine. So I updated the way we retrieve the vector array data from postgres' Datum and the way we pass it to usearch. I also did this with the SQL distance function, so the two are now agreeing (which means column distance values in queries will be appropriately ordered) and, from my testing, they give the correct bitwise hamming distance values.

I updated tests as well.

Note:
It might be nice to include values in small_world_array.sql whose integer and float bit representations differ significantly, to make it easier to catch any future bugs with the hamming distance feature. Right now, the only non-zero vector entry is 1.

Copy link
Contributor

@Ngalstyan4 Ngalstyan4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! just small nit.

@@ -44,8 +44,8 @@ SELECT ROUND(hamming_dist(v, '{0,0}')::numeric, 2) FROM small_world_ham ORDER BY
-------
0.00
2.00
4.00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still anything wrong with this query?
If not, it should be removed from hnsw_todo.sql (and moved to hnsw_dist_func.sql unless an equivalent test is already there)
If something is still wrong, please add a relevant comment.

@Ngalstyan4 Ngalstyan4 merged commit 0c0aceb into lanterndata:main Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants