-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRC64 schema fingerprint endianness #489
Comments
Heh, also when trying to work around this, I was bitten by the schema cache in |
Huh, i thought I got that right, was pretty sure I used the Java tests, but obviously I messed that up. Thanks for catching this. I like the idea of |
Thanks, I can work up a PR! |
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Add an additional crc64.SumLittleEndian function to distrurb existing code as little as possible. Add tests and benchmarks for the Sum functions. Fixes hamba#489.
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Generalize crc64.Sum() with a ByteOrder, so users can choose big/little/native encodings as required. Add tests and benchmarks for the Sum function. Fixes hamba#489.
I added two PRs with slightly different takes:
Let me know what you think. I can follow up with some docs for the one you like better. |
Thanks for the effort here. I am also partial to #491 but it is backwards breaking. I have suggested a middle ground in the PR which would remedy this. |
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Parameterize the crc64 package on byte order, add NewWithByteOrder and SumWithByteOrder top-level functions so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Parameterize the crc64 package on byte order, add NewWithByteOrder and SumWithByteOrder top-level functions so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Parameterize the crc64 package on byte order, add NewWithByteOrder and SumWithByteOrder top-level functions so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.
The Avro specification details a Single Object Encoding using a header to associate a schema ID with an Avro payload. The ID is defined as the CRC64 fingerprint in little-endian encoding. The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO fingerprint type is implemented as such. The specification does not detail endianness of the CRC64-AVRO fingerprint itself (only when embedded in an SOE header). To avoid breaking existing CRC64-AVRO fingerprints, add a new fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except little-endian. Add NewWithByteOrder and SumWithByteOrder top-level functions to crc64 so users can configure the hasher to use a specific byte order. Add tests and benchmarks for the SumWithByteOrder function. Fixes hamba#489.
First off, thank you for a great library!
We use Avro's Single Object Encoding [1] to add a little header to binary payloads; basically:
In interop with the canonical Java library, we see failures to identify the schema here [2]. That code is very Java, but it essentially decodes the 8 fingerprint bytes, little endian, into a 64-bit integer. Their fingerprinting algorithm produces a little-endian byte sequence [3], so that's consistent.
Unfortunately there's a byte-order inconsistency with hamba/avro here. Your CRC64 implementation produces a big-endian fingerprint [4].
The specification doesn't say anything about byte order or representation of fingerprints in the abstract (only for Single Object Encoding), but I wonder if it would be a good idea to be consistent to avoid interop surprises?
This is a breaking change, of course, so if that's a concern, would you consider taking patches for a separate little-endian fingerprint type
avro.CRC64LE
? Or would you prefer to turn the default? Or leave it to clients, maybe with a doc update somewhere? I'm happy to do the work if you guide me as to what changes you'd prefer (if any).[1] https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding
[2] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/message/BinaryMessageDecoder.java#L156
[3] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java#L64
[4]
avro/pkg/crc64/crc64.go
Line 96 in 68046a4
The text was updated successfully, but these errors were encountered: