-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][client] Add schema cache to improve performance #23808
Conversation
@yunmaoQu Please add the following content to your PR description and select a checkbox:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several inconsistencies in this PR. For example, the class names and the class file names don't match. Please test this PR in your own fork first to ensure that it passes tests.
It seems that this PR contains a lot of features related to the schema caching. Instead of adding a lot of features, it would be better to keep the implementation to the minimum.
I'm surprised by competing implementations for implementing the schema cache. There's currently already an open PR #23777.
ok,i test all and could you review it and give me some suggestions |
Instead of adding more code to test everything, please reduce to a minimal implementation. This means to remove features to track cache metrics. That's not something that is needed. For the cache implementation, I'd suggest using a example of minimal implementation for private static final ConcurrentMap<Class<?>, Schema<?>> PROTOBUF_CACHE = new MapMaker().weakKeys().makeMap();
public <T extends com.google.protobuf.GeneratedMessageV3> Schema<T> newProtobufSchema(Class<T> clazz) {
return (Schema<T>) PROTOBUF_CACHE.computeIfAbsent(clazz,
k -> ProtobufSchema.of(SchemaDefinition.builder().withPojo(clazz).build())).clone();
} There shouldn't be a need to ever clear the cache since it's bounded by the number of classes with strong references. It won't consume a significant amount of memory in the first place. |
OK.Should i implement it based on the pre commit or what? |
That's something you can decide. Please read my previous message and draw your conclusions. |
Yes. We can work it together.@walkinggo |
I implement a minimal version. Could you review it and give me some suggestion. Thanks for your previous guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very close to the minimal implementation. I added a few minor comments.
...lient/src/main/java/org/apache/pulsar/client/impl/PulsarClientImplementationBindingImpl.java
Outdated
Show resolved
Hide resolved
...lient/src/main/java/org/apache/pulsar/client/impl/PulsarClientImplementationBindingImpl.java
Outdated
Show resolved
Hide resolved
...lient/src/main/java/org/apache/pulsar/client/impl/PulsarClientImplementationBindingImpl.java
Outdated
Show resolved
Hide resolved
...lient/src/main/java/org/apache/pulsar/client/impl/PulsarClientImplementationBindingImpl.java
Outdated
Show resolved
Hide resolved
Please also update the PR description to match the minimal implementation. |
@lhotari I have done this. |
@yunmaoQu I don't see that the PR description has been updated to match the minimal implementation. For example, the "modifications" part hasn't been updated.
|
checkstyle error:
|
@yunmaoQu I'd recommend to run CI builds in your fork so that you get CI feedback while working on the changes. Some of that is explained in the contribution guide, https://pulsar.apache.org/contribute/personal-ci/ . You will also need to enable GitHub Actions in your apache/pulsar fork repository in GitHub UI. |
Thanks a lot. |
- add a weak reference cache for caching a scheme instance for Schema.AVRO, Schema.JSON, Schema.PROTOBUF.
941beb6
to
fd24bd1
Compare
@lhotari I've modified the error according to the CI's prompt. But when i run personal CI ,it still reports an error. |
CI isn't the only choice. You can also reproduce the errors locally. For CI feedback, I'd recommend creating a PR in your own fork so that the PR appears at https://github.com/yunmaoQu/pulsar/pulls . When you push changes to the branch, the CI will trigger and you won't have to depend on CI feedback from apache/pulsar CI runs which will only run after someone approves the workflow run. |
For locally running a sanity check, you can use this command: mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true -DnarPluginPhase=none |
There are multiple checkstyle errors:
I'd recommend following the contribution guide to properly configure IntelliJ/IDEA for Pulsar development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unnecessary cleanup strategy stuff is back. We don't need those type of features.
Very Sorry. A little git operation error. |
9bfbe8a
to
1cf7b61
Compare
@lhotari I test CI in my personal repo,the part i change is ok, you can see https://github.com/yunmaoQu/pulsar/pulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are now fine for the minimal implementation. I did some manual checks and noticed that adding this cache won't resolve the performance issue. For Avro schema, the problem is that the .clone()
method doesn't by-pass the already parsed Avro schema instance to the new cloned instance. Since it might be hard to detect such issues, it would be necessary to have a micro benchmark which could be used to detect the issues. In Pulsar, we have the module microbench
where such a benchmark could be added. The module has JMH configured. In JMH, it's also possible to enable profiling using AsyncProfiler or Java Flight Recorder to find the performance hotspots. I won't be able to guide through all steps required to handle this. It would be necessary to add a micro benchmark and then address the performance issues that show up.
@lhotari This pr seems make no sense .Should i close it and focus on other issue? |
close #23707
Motivation
Schema creation (e.g.,
Schema.AVRO(SomeClass.class)
) is fairly CPU intensive. It would be useful it there would be a weak reference cache for caching the schema instance.Modifications
Add SchemaCache implementation using WeakHashMap for schema instance caching
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
N/A