Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Avro-related code from Java SDK "core" module #27851

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

aromanenko-dev
Copy link
Contributor

@aromanenko-dev aromanenko-dev commented Aug 4, 2023

This is the final step to remove Avro dependency from Java SDK "core" after it was deprecated in 2.46.0 release.

Closes #25252


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@aromanenko-dev aromanenko-dev marked this pull request as draft August 4, 2023 15:13
@github-actions github-actions bot added the java label Aug 4, 2023
@@ -485,8 +486,8 @@ public long getSplitBacklogBytes() {
* The checkpoint for an unbounded {@link CountingSource} is simply the last value produced. The
* associated source object encapsulates the information needed to produce the next value.
*/
@DefaultCoder(AvroCoder.class)
public static class CounterMark implements UnboundedSource.CheckpointMark {
@DefaultCoder(SerializableCoder.class)
Copy link
Contributor Author

@aromanenko-dev aromanenko-dev Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not be a breaking change, should it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly a breaking change. Any checkpoint that was serialized and persisted with a previous version of Beam will fail on deserialization once upgrading Beam :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but I don't think Beam supports a hot upgrade.
I don't see a way how to avoid such change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a note about this (as well as the larger change) in CHANGES.md.

Serialization isn't guaranteed to be stable either--if we're switching coders should we switch to a custom (or Beam-schema-based) coder here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worries me a bit that there is no workaround. I suppose we could advise copying the old CountingSource into your own project and using that directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialization might not be stable when upgrading a pipeline between Java versions (or, technically, if this class ever gets changed or moved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right. I guess a use case for that - it should be a streaming pipeline that was restarted with different (upgraded) version of Beam. Do we commit (Beam) to support this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's a solid commitment (e.g. this is itself a breaking change), but we really try hard where feasible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend creating a custom coder which is binary compatible to the Avro based one. This is actually rather trivial, Avro does zigzag varint encoding and the bytes of both longs in CounterMark are just written after another without anything added.

Zigzag encoding is also trivial to implement, or using org.apache.beam.vendor.grpc.v1p54p0.com.google.protobuf.CodedOutputStream.encodeZigZag64 (and Protobuf will always remain a core dependency):

  public static long encodeZigZag64(final long n) {
    return (n << 1) ^ (n >> 63);
  }
  public static long decodeZigZag64(final long n) {
    return (n >>> 1) ^ -(n & 1);
  }

And varint encoding / decoding is already implemented in org.apache.beam.sdk.util.VarInt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosche I think you have a point but it's overkill in this case. I'd stick with SchemaCoder as Robert suggested.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 15, 2023

Test Results

  1 257 files  +1 111    1 257 suites  +1 111   2h 52m 43s ⏱️ + 2h 13m 31s
10 432 tests +9 098  10 362 ✔️ +9 061  68 💤 +35  2 +2 
10 456 runs  +9 120  10 386 ✔️ +9 083  68 💤 +35  2 +2 

For more details on these failures, see this check.

Results for commit f5ee525a. ± Comparison against base commit cf0cf3b.

This pull request removes 9 and adds 9105 tests. Note that renamed tests count towards both.
id]
id`]
org.apache.beam.sdk.extensions.sql.CalciteCannotParseSimpleIdentifiersTest ‑ testFailsToParseAlias[`field
org.apache.beam.sdk.extensions.sql.CalciteCannotParseSimpleIdentifiersTest ‑ testFailsToParseAlias[field
org.apache.beam.sdk.extensions.sql.impl.LazyAggregateCombineFnTest$UdafImplTest ‑ subclassGetUdafImpl[aggregateFn: org.apache.beam.sdk.extensions.sql.impl.LazyAggregateCombineFnTest$Sum@ac9fa95]
org.apache.beam.sdk.extensions.sql.impl.LazyAggregateCombineFnTest$UdafImplTest ‑ subclassGetUdafImpl[aggregateFn: org.apache.beam.sdk.extensions.sql.impl.LazyAggregateCombineFnTest$SumChild@67172cd]
org.apache.beam.sdk.testutils.jvmverification.JvmVerification ‑ verifyCodeIsCompiledWithJava8
org.apache.beam.sdk.testutils.jvmverification.JvmVerification ‑ verifyRunningJVMVersionIs17
org.apache.beam.sdk.testutils.jvmverification.JvmVerification ‑ verifyTestCodeIsCompiledWithJava17
DefaultPackageTest ‑ defaultPackageInvoker
org.apache.beam.examples.DebuggingWordCountTest ‑ testDebuggingWordCount
org.apache.beam.examples.MinimalWordCountTest ‑ testMinimalWordCount
org.apache.beam.examples.WindowedWordCountIT ‑ testWindowedWordCountInBatchDynamicSharding
org.apache.beam.examples.WindowedWordCountIT ‑ testWindowedWordCountInBatchStaticSharding
org.apache.beam.examples.WindowedWordCountIT ‑ testWindowedWordCountInStreamingStaticSharding
org.apache.beam.examples.WordCountIT ‑ testE2EWordCount
org.apache.beam.examples.WordCountTest ‑ testCountWords
org.apache.beam.examples.WordCountTest ‑ testExtractWordsFn
org.apache.beam.examples.complete.AutoCompleteTest ‑ testAutoComplete[0]
…

♻️ This comment has been updated with latest results.

@aromanenko-dev
Copy link
Contributor Author

retest this please

@aromanenko-dev aromanenko-dev marked this pull request as ready for review September 18, 2023 09:04
@aromanenko-dev aromanenko-dev changed the title [WIP] Remove Avro-related code from Java SDK "core" module Remove Avro-related code from Java SDK "core" module Sep 18, 2023
@aromanenko-dev
Copy link
Contributor Author

aromanenko-dev commented Sep 18, 2023

@mosche Could you take a look, please? I think it's ready for review.

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@aromanenko-dev
Copy link
Contributor Author

Run Java_Pulsar_IO_Direct PreCommit

Copy link
Member

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments @aromanenko-dev
Are you sure enough time has already passed to remove this from core, it hasn't been deprecated for too long?

@@ -485,8 +486,8 @@ public long getSplitBacklogBytes() {
* The checkpoint for an unbounded {@link CountingSource} is simply the last value produced. The
* associated source object encapsulates the information needed to produce the next value.
*/
@DefaultCoder(AvroCoder.class)
public static class CounterMark implements UnboundedSource.CheckpointMark {
@DefaultCoder(SerializableCoder.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly a breaking change. Any checkpoint that was serialized and persisted with a previous version of Beam will fail on deserialization once upgrading Beam :/

@@ -472,7 +472,7 @@ public void testCoderPrecedence() throws Exception {
assertEquals(SerializableCoder.of(MyValueC.class), registry.getCoder(MyValueC.class));
}

@DefaultCoder(AvroCoder.class)
@DefaultCoder(MockAvroCoder.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use SerializableCoder here if the intention is to just test precedence of coders, any mentioning of Avro will just confuse people.

Copy link
Contributor Author

@aromanenko-dev aromanenko-dev Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this class for tests where SerializableCoder is used implicitly and we need to have another default coder to test a correct coder precedence. So I use MockDefaultCoder for that.

@@ -39,7 +39,7 @@ public class DefaultCoderTest {

@Rule public ExpectedException thrown = ExpectedException.none();

@DefaultCoder(AvroCoder.class)
@DefaultCoder(MockAvroCoder.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the record class and coder to not contain Avro. Also, I'd recommend to inline the coder impl into the test class to limit its scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to MockDefaultCoder and it's used in other test as well, so let's keep it as a separate file

@kennknowles
Copy link
Member

Regarding whether enough time has passed: I am OK with it. Users have an easy migration path. Thanks for following through with this very difficult change!

@aromanenko-dev aromanenko-dev force-pushed the 25252_no_avro_in_core branch 2 times, most recently from 451a504 to f22a161 Compare October 13, 2023 15:44
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finishing up this huge effort!

@aromanenko-dev
Copy link
Contributor Author

Run Java_Examples_Dataflow_Java11 PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java_Examples_Dataflow_Java17 PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java_Examples_Dataflow_Java11 PreCommit

@aromanenko-dev
Copy link
Contributor Author

aromanenko-dev commented Oct 14, 2023

Seems like all tests passed, failed jobs looks not related.
Java PreCommit passed but status was not updated on this page.

@mosche @robertwb @kennknowles
Thanks for review. Please, let me know if you think about some other checks that we have to run before merge.

@kennknowles
Copy link
Member

The precommit timed out and I can't quite tell but maybe locally try: :runners:google-cloud-dataflow-java:worker:test

I don't think this requires cloud.

@kennknowles
Copy link
Member

run java precommit

@aromanenko-dev
Copy link
Contributor Author

retest this please

@aromanenko-dev
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java PostCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java_Amazon-Web-Services2_IO_Direct PreCommit

@aromanenko-dev
Copy link
Contributor Author

Run Java PostCommit

@aromanenko-dev
Copy link
Contributor Author

@kennknowles

The precommit timed out and I can't quite tell but maybe locally try: :runners:google-cloud-dataflow-java:worker:test

Yes, it passed locally

@aromanenko-dev
Copy link
Contributor Author

There are two failing PreCommit actions but I don't think they are related.

So, if there are no objections or any other comments, I'd merge this PR.

@robertwb
Copy link
Contributor

robertwb commented Oct 18, 2023 via email

@aromanenko-dev aromanenko-dev merged commit 5fd1245 into apache:master Oct 18, 2023
@aromanenko-dev aromanenko-dev added this to the 2.52.0 Release milestone Oct 18, 2023
@kennknowles
Copy link
Member

🎉 🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: [Avro] Stop using Avro in "core"
4 participants