CRC64 schema fingerprint endianness #489

kimgr · 2025-01-16T08:14:36Z

First off, thank you for a great library!

We use Avro's Single Object Encoding [1] to add a little header to binary payloads; basically:

magic := []byte{0xc3, 0x01}
fingerprint, _ := schema.FingerprintUsing(avro.CRC64Avro)
header := append(magic, fingerprint...)

In interop with the canonical Java library, we see failures to identify the schema here [2]. That code is very Java, but it essentially decodes the 8 fingerprint bytes, little endian, into a 64-bit integer. Their fingerprinting algorithm produces a little-endian byte sequence [3], so that's consistent.

Unfortunately there's a byte-order inconsistency with hamba/avro here. Your CRC64 implementation produces a big-endian fingerprint [4].

The specification doesn't say anything about byte order or representation of fingerprints in the abstract (only for Single Object Encoding), but I wonder if it would be a good idea to be consistent to avoid interop surprises?

This is a breaking change, of course, so if that's a concern, would you consider taking patches for a separate little-endian fingerprint type avro.CRC64LE? Or would you prefer to turn the default? Or leave it to clients, maybe with a doc update somewhere? I'm happy to do the work if you guide me as to what changes you'd prefer (if any).

[1] https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding
[2] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/message/BinaryMessageDecoder.java#L156
[3] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java#L64
[4]

avro/pkg/crc64/crc64.go

Line 96 in 68046a4

    
           return [Size]byte{byte(s >> 56), byte(s >> 48), byte(s >> 40), byte(s >> 32), byte(s >> 24), byte(s >> 16), byte(s >> 8), byte(s)}

The text was updated successfully, but these errors were encountered:

kimgr · 2025-01-16T11:13:26Z

Heh, also when trying to work around this, I was bitten by the schema cache in FingerprintUsing which effectively uses a global variable for each (schema,algo) pair. Reversing byte order on the returned fingerprint would modify the cached state and only work every other time.

nrwiersma · 2025-01-16T18:01:12Z

Huh, i thought I got that right, was pretty sure I used the Java tests, but obviously I messed that up.

Thanks for catching this.

I like the idea of avro.CRC64LE to make it non-breaking and possibly documenting this in the README. This can get cleaned up in the next major.

kimgr · 2025-01-16T18:44:28Z

Thanks, I can work up a PR!

The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Add an additional crc64.SumLittleEndian function to distrurb existing code as little as possible. Add tests and benchmarks for the Sum functions. Fixes hamba#489.

The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Generalize crc64.Sum() with a ByteOrder, so users can choose big/little/native encodings as required. Add tests and benchmarks for the Sum function. Fixes hamba#489.

kimgr · 2025-01-20T12:26:28Z

I added two PRs with slightly different takes:

feat: add CRC64-AVRO-LE fingerprint type #491 is my favorite -- it generalizes crc64.Sum() to let users pick byte order explicitly. Possible cons: noisy dependency on encoding/binary.ByteOrder. Benchmark shows 1 extra alloc for forming the [:] slice for PutUint64, but I believe that will happen in FingerprintUsing anyway, so I think this just shifts the allocation earlier.
feat: add CRC64-AVRO-LE fingerprint type #490 is as non-intrusive as I can make it.

Let me know what you think. I can follow up with some docs for the one you like better.

nrwiersma · 2025-01-20T15:26:30Z

Thanks for the effort here. I am also partial to #491 but it is backwards breaking. I have suggested a middle ground in the PR which would remedy this.

The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Parameterize the crc64 package on byte order, add NewWithByteOrder and SumWithByteOrder top-level functions so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.

The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Add NewWithByteOrder and SumWithByteOrder top-level functions to crc64 so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.

This was referenced Jan 20, 2025

feat: add CRC64-AVRO-LE fingerprint type #490

Closed

feat: add CRC64-AVRO-LE fingerprint type #491

Merged

nrwiersma closed this as completed in #491 Jan 30, 2025

nrwiersma closed this as completed in e9a408d Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRC64 schema fingerprint endianness #489

CRC64 schema fingerprint endianness #489

kimgr commented Jan 16, 2025 •

edited

Loading

kimgr commented Jan 16, 2025

nrwiersma commented Jan 16, 2025

kimgr commented Jan 16, 2025

kimgr commented Jan 20, 2025

nrwiersma commented Jan 20, 2025

CRC64 schema fingerprint endianness #489

CRC64 schema fingerprint endianness #489

Comments

kimgr commented Jan 16, 2025 • edited Loading

kimgr commented Jan 16, 2025

nrwiersma commented Jan 16, 2025

kimgr commented Jan 16, 2025

kimgr commented Jan 20, 2025

nrwiersma commented Jan 20, 2025

kimgr commented Jan 16, 2025 •

edited

Loading