Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC64 schema fingerprint endianness #489

Closed
kimgr opened this issue Jan 16, 2025 · 5 comments · Fixed by #491
Closed

CRC64 schema fingerprint endianness #489

kimgr opened this issue Jan 16, 2025 · 5 comments · Fixed by #491

Comments

@kimgr
Copy link
Contributor

kimgr commented Jan 16, 2025

First off, thank you for a great library!

We use Avro's Single Object Encoding [1] to add a little header to binary payloads; basically:

magic := []byte{0xc3, 0x01}
fingerprint, _ := schema.FingerprintUsing(avro.CRC64Avro)
header := append(magic, fingerprint...)

In interop with the canonical Java library, we see failures to identify the schema here [2]. That code is very Java, but it essentially decodes the 8 fingerprint bytes, little endian, into a 64-bit integer. Their fingerprinting algorithm produces a little-endian byte sequence [3], so that's consistent.

Unfortunately there's a byte-order inconsistency with hamba/avro here. Your CRC64 implementation produces a big-endian fingerprint [4].

The specification doesn't say anything about byte order or representation of fingerprints in the abstract (only for Single Object Encoding), but I wonder if it would be a good idea to be consistent to avoid interop surprises?

This is a breaking change, of course, so if that's a concern, would you consider taking patches for a separate little-endian fingerprint type avro.CRC64LE? Or would you prefer to turn the default? Or leave it to clients, maybe with a doc update somewhere? I'm happy to do the work if you guide me as to what changes you'd prefer (if any).

[1] https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding
[2] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/message/BinaryMessageDecoder.java#L156
[3] https://github.com/apache/avro/blob/8e51c7e1c14116545c7b08e72f649064cbd9f1bb/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java#L64
[4]

return [Size]byte{byte(s >> 56), byte(s >> 48), byte(s >> 40), byte(s >> 32), byte(s >> 24), byte(s >> 16), byte(s >> 8), byte(s)}

@kimgr
Copy link
Contributor Author

kimgr commented Jan 16, 2025

Heh, also when trying to work around this, I was bitten by the schema cache in FingerprintUsing which effectively uses a global variable for each (schema,algo) pair. Reversing byte order on the returned fingerprint would modify the cached state and only work every other time.

@nrwiersma
Copy link
Member

Huh, i thought I got that right, was pretty sure I used the Java tests, but obviously I messed that up.

Thanks for catching this.

I like the idea of avro.CRC64LE to make it non-breaking and possibly documenting this in the README. This can get cleaned up in the next major.

@kimgr
Copy link
Contributor Author

kimgr commented Jan 16, 2025

Thanks, I can work up a PR!

kimgr added a commit to kimgr/avro that referenced this issue Jan 20, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Add an additional crc64.SumLittleEndian function to distrurb existing
code as little as possible.

Add tests and benchmarks for the Sum functions.

Fixes hamba#489.
kimgr added a commit to kimgr/avro that referenced this issue Jan 20, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Generalize crc64.Sum() with a ByteOrder, so users can choose
big/little/native encodings as required.

Add tests and benchmarks for the Sum function.

Fixes hamba#489.
@kimgr
Copy link
Contributor Author

kimgr commented Jan 20, 2025

I added two PRs with slightly different takes:

  • feat: add CRC64-AVRO-LE fingerprint type #491 is my favorite -- it generalizes crc64.Sum() to let users pick byte order explicitly. Possible cons: noisy dependency on encoding/binary.ByteOrder. Benchmark shows 1 extra alloc for forming the [:] slice for PutUint64, but I believe that will happen in FingerprintUsing anyway, so I think this just shifts the allocation earlier.
  • feat: add CRC64-AVRO-LE fingerprint type #490 is as non-intrusive as I can make it.

Let me know what you think. I can follow up with some docs for the one you like better.

@nrwiersma
Copy link
Member

Thanks for the effort here. I am also partial to #491 but it is backwards breaking. I have suggested a middle ground in the PR which would remedy this.

kimgr added a commit to kimgr/avro that referenced this issue Jan 23, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Parameterize the crc64 package on byte order, add NewWithByteOrder and
SumWithByteOrder top-level functions so users can configure the hasher
to use a specific byte order.

Add tests and benchmarks for the SumWithByteOrder function.

Fixes hamba#489.
kimgr added a commit to kimgr/avro that referenced this issue Jan 23, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Parameterize the crc64 package on byte order, add NewWithByteOrder and
SumWithByteOrder top-level functions so users can configure the hasher
to use a specific byte order.

Add tests and benchmarks for the SumWithByteOrder function.

Fixes hamba#489.
kimgr added a commit to kimgr/avro that referenced this issue Jan 23, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Parameterize the crc64 package on byte order, add NewWithByteOrder and
SumWithByteOrder top-level functions so users can configure the hasher
to use a specific byte order.

Add tests and benchmarks for the SumWithByteOrder function.

Fixes hamba#489.
kimgr added a commit to kimgr/avro that referenced this issue Jan 24, 2025
The Avro specification details a Single Object Encoding using a header
to associate a schema ID with an Avro payload. The ID is defined as the
CRC64 fingerprint in little-endian encoding.

The pkg/crc64 module only provides big-endian CRC64, and the CRC64-AVRO
fingerprint type is implemented as such. The specification does not
detail endianness of the CRC64-AVRO fingerprint itself (only when
embedded in an SOE header).

To avoid breaking existing CRC64-AVRO fingerprints, add a new
fingerprint type CRC64-AVRO-LE, identical to CRC64-AVRO except
little-endian.

Add NewWithByteOrder and SumWithByteOrder top-level functions to crc64
so users can configure the hasher to use a specific byte order.

Add tests and benchmarks for the SumWithByteOrder function.

Fixes hamba#489.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants