Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111

Merged
merged 43 commits into from
Jan 23, 2025

Conversation

tolik0
Copy link
Contributor

@tolik0 tolik0 commented Dec 4, 2024

Summary by CodeRabbit

  • New Features

    • Added support for concurrent cursor processing in declarative sources.
    • Enhanced incremental sync capabilities with per-partition and global cursor strategies.
    • Introduced new methods for creating concurrent cursors based on partitioned data streams.
    • Added a method to check if the timer is running in the Timer class.
  • Improvements

    • Streamlined record filtering and cursor handling.
    • Improved error handling for cursor and record management.
    • Updated request option retrieval in retrievers.
  • Technical Updates

    • Introduced new cursor classes for more flexible incremental synchronization.
    • Expanded test coverage for concurrent and incremental stream processing.
    • Updated tests to reflect changes in stream categorization and expected behaviors.
    • Added detailed unit tests for low-code connector streams and hierarchical substreams.
    • Enhanced the testing framework for ConcurrentDeclarativeSource and improved error handling in tests.

@github-actions github-actions bot added the enhancement New feature or request label Dec 4, 2024
Copy link
Contributor

coderabbitai bot commented Dec 4, 2024

📝 Walkthrough

Walkthrough

The pull request introduces significant enhancements to the Airbyte CDK's concurrent declarative source handling, focusing on improving incremental sync capabilities for low-code connectors. The changes primarily revolve around extending support for different cursor types, particularly PerPartitionWithGlobalCursor, and refining the way streams are processed concurrently. The modifications span multiple files, introducing new classes, methods, and updating existing logic to support more complex incremental synchronization scenarios.

Changes

File Change Summary
airbyte_cdk/sources/declarative/concurrent_declarative_source.py Added import for PerPartitionWithGlobalCursor and updated _group_streams method to handle new cursor type. New method _get_retriever introduced.
airbyte_cdk/sources/declarative/incremental/__init__.py Added exports for ConcurrentCursorFactory and ConcurrentPerPartitionCursor.
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py Introduced new classes for managing concurrent cursors with advanced state tracking and several new methods for cursor management.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added methods for creating concurrent cursors from different cursor types. Updated existing method to include additional parameters.
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py Updated request handling to use request_option_provider.
airbyte_cdk/sources/declarative/extractors/record_filter.py Simplified cursor handling in ClientSideIncrementalRecordFilterDecorator.
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py Added should_be_synced method and enhanced record validation.
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py Expanded test suite for ConcurrentDeclarativeSource, adding tests for incremental sync scenarios.
unit_tests/sources/declarative/extractors/test_record_filter.py Updated tests to reflect cursor parameter consolidation in ClientSideIncrementalRecordFilterDecorator.
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py Added tests for error handling in incremental sync configurations.
unit_tests/sources/declarative/test_concurrent_declarative_source.py Adjusted tests for expected concurrent stream counts and classifications.
unit_tests/sources/declarative/retrievers/test_simple_retriever.py Updated tests to reflect changes in request handling logic.
airbyte_cdk/sources/streams/concurrent/cursor.py Modified state management in ConcurrentCursor class for better encapsulation.
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py Added is_running method to the Timer class.

Possibly related PRs

Suggested reviewers

Hey there! 👋 I noticed some interesting improvements in the concurrent declarative source handling. A few questions for you:

  1. The changes introduce PerPartitionWithGlobalCursor support - wdyt about how this might impact existing low-code connectors?

  2. I see you've consolidated some cursor handling logic. Do you think this makes the code more readable or potentially introduces any subtle complexity?

  3. The new ConcurrentPerPartitionCursor seems quite powerful. Any specific use cases that inspired its development?

Would love to hear your thoughts! 🚀

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

59-66: Consider simplifying the constructor parameters

The __init__ method of ConcurrentPerPartitionCursor has many parameters, which can make it harder to maintain and understand. Would it be beneficial to encapsulate related parameters into data classes or reduce the number of parameters if possible? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)

23-25: Unused import statement

We import PerPartitionWithGlobalCursor, but it's not used elsewhere in the code. Should we remove this import to keep the code clean? Wdyt?


309-361: Refactor duplicated code in _group_streams method

There seems to be duplicated code in the conditional blocks handling DatetimeBasedCursorModel streams. Could we refactor these blocks into a helper function to reduce redundancy and improve readability? Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

915-970: Consider reducing code duplication

The methods create_concurrent_cursor_from_datetime_based_cursor and create_concurrent_cursor_from_perpartition_cursor share similar logic. Could we extract the common parts into a shared helper function to simplify maintenance? Wdyt?

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

243-251: Consider extracting common logic between close methods?

The new close_partition_without_emit shares a lot of logic with close_partition. What do you think about extracting the common logic into a private method to reduce duplication? Something like this, wdyt?

+    def _close_partition_internal(self, partition: Partition) -> bool:
+        slice_count_before = len(self.state.get("slices", []))
+        self._add_slice_to_state(partition)
+        should_merge = slice_count_before < len(self.state["slices"])
+        if should_merge:
+            self._merge_partitions()
+        self._has_closed_at_least_one_slice = True
+        return should_merge
+
     def close_partition_without_emit(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
-        self._has_closed_at_least_one_slice = True
+        self._close_partition_internal(partition)

     def close_partition(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
+        if self._close_partition_internal(partition):
             self._emit_state_message()
-        self._has_closed_at_least_one_slice = True
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acb6630 and a3304b9.

📒 Files selected for processing (7)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/__init__.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1 hunks)
  • airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

25-29: Question about returning only the first cursor

In the ConcurrentCursorFactory, the create method returns only the first cursor from _create_function. Since _create_function returns a tuple of cursors, should we ensure that we're not discarding any necessary cursors? Perhaps we should handle all cursors returned. Wdyt?


221-224: Ensure partition key serialization handles edge cases

In _to_partition_key and _to_dict, we use self._partition_serializer. Do we need to ensure that partition keys are properly sanitized or encoded to handle special characters and prevent serialization issues? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-318: Type checking with isinstance

We use isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor) to check the type. Should we consider using duck typing or interface checks to allow for subclasses or alternative implementations that fulfill the same protocol? Wdyt?

airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)

41-66: Clarify the purpose of DeclarativePartitionFactory1

The new class DeclarativePartitionFactory1 appears to be similar to DeclarativePartitionFactory but with a different approach to handling the retriever. Is this intended to replace the original factory or serve a different purpose? Providing more context or renaming the class for clarity might help. Wdyt?

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)

181-181: LGTM!

The change to use request_option_provider instead of stream_slicer improves modularity by separating concerns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

152-156: Could we simplify the partition limit condition in _ensure_partition_limit method?

Currently, the condition is len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1. Would it improve readability to use len(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER instead? This change might make the maximum partition limit clearer. Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

910-965: Consider extracting common validation logic and fixing indentation.

A few suggestions to improve this new method:

  1. The component type validation logic (lines 922-926) is duplicated from create_concurrent_cursor_from_datetime_based_cursor. Should we extract it into a helper method to follow DRY?
  2. The indentation in the return statement (lines 955-964) seems inconsistent with the rest of the file. Should we align it with the standard indentation?
  3. Instead of using type ignore on line 961, we could properly type message_repository. Wdyt?
     def create_concurrent_cursor_from_perpartition_cursor(
         self,
         state_manager: ConnectorStateManager,
         model_type: Type[BaseModel],
         component_definition: ComponentDefinition,
         stream_name: str,
         stream_namespace: Optional[str],
         config: Config,
         stream_state: MutableMapping[str, Any],
         partition_router,
             **kwargs: Any,
     ) -> ConcurrentPerPartitionCursor:
-        component_type = component_definition.get("type")
-        if component_definition.get("type") != model_type.__name__:
-            raise ValueError(
-                f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
-            )
+        self._validate_component_type(component_definition, model_type)

         # ... rest of the method ...

         return ConcurrentPerPartitionCursor(
-                cursor_factory=cursor_factory,
-                partition_router=partition_router,
-                stream_name=stream_name,
-                stream_namespace=stream_namespace,
-                stream_state=stream_state,
-                message_repository=self._message_repository,  # type: ignore
-                connector_state_manager=state_manager,
-                cursor_field=cursor_field,
-            )
+            cursor_factory=cursor_factory,
+            partition_router=partition_router,
+            stream_name=stream_name,
+            stream_namespace=stream_namespace,
+            stream_state=stream_state,
+            message_repository=self._message_repository,
+            connector_state_manager=state_manager,
+            cursor_field=cursor_field,
+        )

Helper method to add:

def _validate_component_type(self, component_definition: ComponentDefinition, expected_type: Type[BaseModel]) -> None:
    component_type = component_definition.get("type")
    if component_definition.get("type") != expected_type.__name__:
        raise ValueError(
            f"Expected manifest component of type {expected_type.__name__}, but received {component_type} instead"
        )
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)

276-276: Should the test case name match the function name for consistency?

In the parameterization at line 276, the test name is "test_incremental_parent_state", but the test function is named test_incremental_parent_state_no_incremental_dependency. Aligning the test case name with the function name could enhance clarity. Wdyt?


483-495: Could we improve the docstring formatting for better readability?

The docstring contains a detailed explanation of the test, but it's indented inconsistently. Adjusting the indentation could enhance readability and maintain consistency with PEP 257 guidelines. Wdyt?


519-523: Is comparing only the last state sufficient to validate the final state?

In the assertion at lines 519-523, we compare final_state[-1] with expected_state. Should we consider comparing all elements of final_state to ensure that all state messages throughout the sync match the expectations? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4ddbb84 and dfcf17f.

📒 Files selected for processing (7)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_filter.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_filter.py (1)

78-81: Is it safe to use an empty stream_name when creating Record instances?

In the filter_records method, we're creating Record instances with an empty stream_name. Could this lead to issues if other parts of the codebase expect stream_name to be non-empty? Should we consider passing the actual stream name to prevent potential problems? Wdyt?

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

243-251: Could we clarify the behavior of close_partition_without_emit method?

In the close_partition_without_emit method, we update _has_closed_at_least_one_slice to True, but we don't emit a state message. Is this intended? Should we ensure that the state is eventually emitted elsewhere to keep the state consistent? Wdyt?

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

❌ Job failed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)

Line range hint 341-432: Comprehensive test coverage for cursor types! 👍

The parameterized tests cover all important cursor scenarios:

  • Basic datetime cursor
  • Global substream cursor (with and without state)
  • Per-partition cursor with various state configurations

One thought: would it be valuable to add a test case for error handling when an invalid cursor type is passed? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dfcf17f and b84e68a.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/extractors/test_record_filter.py (2 hunks)
🔇 Additional comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)

293-293: Nice API simplification! 🎉

I see you've consolidated the cursor parameters (date_time_based_cursor and substream_cursor) into a single cursor parameter. This makes the API cleaner and more flexible. The tests demonstrate it works well with all cursor types (DatetimeBasedCursor, GlobalSubstreamCursor, and PerPartitionWithGlobalCursor).

Also applies to: 431-431

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

305-316: Consider extracting condition to a descriptive method

The nested conditions make the code harder to read and maintain. Would you consider extracting these conditions into a separate method with a descriptive name? Wdyt?

Something like:

+def _supports_concurrent_perpartition_cursor(
+    self,
+    declarative_stream: DeclarativeStream,
+    incremental_sync_component_definition: Mapping[str, Any]
+) -> bool:
+    return (
+        incremental_sync_component_definition
+        and incremental_sync_component_definition.get("type", "") == DatetimeBasedCursorModel.__name__
+        and self._stream_supports_concurrent_partition_processing(declarative_stream=declarative_stream)
+        and hasattr(declarative_stream.retriever, "stream_slicer")
+        and isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor)
+    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b84e68a and a6b5daa.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

84-85: ⚠️ Potential issue

Add missing comma between imports

There's a missing comma between ConcurrentCursorFactory and ConcurrentPerPartitionCursor in the import statement.

Apply this diff:

-    ConcurrentCursorFactory
-    ConcurrentPerPartitionCursor,
+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (2)

Line range hint 320-358: LGTM! Well-structured comparison logic

The implementation includes proper null checks and validation for partition matching. The error messages are clear and descriptive.


306-319: ⚠️ Potential issue

Add null check for record.associated_slice

The code accesses record.associated_slice.partition without verifying if record.associated_slice is not None. This could lead to a NullPointerException.

Consider adding a null check like this:

def should_be_synced(self, record: Record) -> bool:
+   if not record.associated_slice:
+       raise ValueError("Record must have an associated slice")
    if (
        self._to_partition_key(record.associated_slice.partition)
        not in self._cursor_per_partition
    ):
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-331: Consider handling potential None stream_state

When retrieving stream_state, we might want to handle the case where it could be None to prevent unexpected errors during cursor creation. Wdyt?

Let's verify the current handling of stream_state:

#!/bin/bash
# Check how stream_state is handled in other parts of the codebase
rg -A 3 "get_stream_state" | grep -v "test"

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy with the progress on this! I've added a couple of comments more in terms of code maintenance and structure but the functional part seemed fine. I want to check the tests eventually (tomorrow hopefully) but I can still leave a couple of comments here

@tolik0 tolik0 force-pushed the tolik0/concurrent-perpartitioncursor branch from a6b5daa to 79ffb77 Compare December 18, 2024 16:18
@tolik0
Copy link
Contributor Author

tolik0 commented Dec 18, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)

306-319: ⚠️ Potential issue

Should we add a null check for record.associated_slice?

The code accesses record.associated_slice.partition without verifying if record.associated_slice is not None. This could raise an AttributeError. Consider adding a null check, wdyt?

 def should_be_synced(self, record: Record) -> bool:
+    if not record.associated_slice:
+        raise ValueError("Record must have an associated slice")
     if (
         self._to_partition_key(record.associated_slice.partition)
         not in self._cursor_per_partition
     ):
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)

244-268: ⚠️ Potential issue

Should we pass the state parameter to the read method?

The state parameter is passed to ConcurrentDeclarativeSource initialization but not to the read method (empty list is passed instead). This might affect state handling. Consider using the state parameter in both places, wdyt?

-    messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=[]))
+    messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=state))
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

962-1018: Consider adding error handling for cursor creation

The cursor creation logic could benefit from additional error handling similar to create_concurrent_cursor_from_datetime_based_cursor. For instance, what happens if the cursor factory creation fails? Wdyt?

    def create_concurrent_cursor_from_perpartition_cursor(
        self,
        state_manager: ConnectorStateManager,
        model_type: Type[BaseModel],
        component_definition: ComponentDefinition,
        stream_name: str,
        stream_namespace: Optional[str],
        config: Config,
        stream_state: MutableMapping[str, Any],
        partition_router,
        **kwargs: Any,
    ) -> ConcurrentPerPartitionCursor:
+       try:
            component_type = component_definition.get("type")
            if component_definition.get("type") != model_type.__name__:
                raise ValueError(
                    f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
                )

            # Rest of the implementation...

            return ConcurrentPerPartitionCursor(
                cursor_factory=cursor_factory,
                partition_router=partition_router,
                stream_name=stream_name,
                stream_namespace=stream_namespace,
                stream_state=stream_state,
                message_repository=self._message_repository,
                connector_state_manager=state_manager,
                cursor_field=cursor_field,
            )
+       except Exception as e:
+           raise ValueError(
+               f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}"
+           ) from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b5daa and a36726b.

📒 Files selected for processing (6)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (7 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/streams/concurrent/cursor.py
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (5)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)

482-528: LGTM! Well-structured test with clear documentation.

The test case is comprehensive and effectively validates the parent state behavior when incremental dependency is disabled.


530-630: LGTM! Excellent test helper implementation.

The helper function is well-documented and implements thorough validation steps for testing incremental parent state behavior.


1124-1556: LGTM! Comprehensive test coverage.

The test functions thoroughly cover different scenarios including state migration and empty record sets. The test cases are well-structured and documented.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

Line range hint 814-961: LGTM! Well-structured implementation with proper error handling.

The implementation of create_concurrent_cursor_from_datetime_based_cursor is thorough and handles all edge cases appropriately.


84-85: ⚠️ Potential issue

Add missing comma between imports

There seems to be a missing comma between ConcurrentCursorFactory and ConcurrentPerPartitionCursor. Wdyt?

-    ConcurrentCursorFactory
-    ConcurrentPerPartitionCursor,
+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)

31-32: Consider adding more specific type hints for create_function.

The create_function parameter could benefit from more specific type hints. Would you consider updating it to explicitly show the expected parameter types? Wdyt?

-    def __init__(self, create_function: Callable[..., ConcurrentCursor]):
+    def __init__(
+        self,
+        create_function: Callable[[Mapping[str, Any], Optional[Any]], ConcurrentCursor]
+    ):

42-54: Consider enhancing documentation with thread safety details.

The class uses locks and semaphores, but the documentation doesn't mention thread safety considerations. Would you like to add a section about thread safety and concurrent access patterns? Wdyt?


57-58: Consider using frozen dictionaries for constants.

The _NO_STATE and _NO_CURSOR_STATE are mutable mappings. To prevent accidental modifications, would you consider using MappingProxyType or frozendict? Wdyt?

-    _NO_STATE: Mapping[str, Any] = {}
-    _NO_CURSOR_STATE: Mapping[str, Any] = {}
+    _NO_STATE: Mapping[str, Any] = MappingProxyType({})
+    _NO_CURSOR_STATE: Mapping[str, Any] = MappingProxyType({})

102-121: Consider optimizing state copying strategy.

The state property performs deep copies of cursor states for each partition. For better performance with large states, would you consider:

  1. Using shallow copies where safe
  2. Implementing copy-on-write
  3. Caching the state if it hasn't changed

Wdyt?


306-314: Enhance error messages for better debugging.

The error messages could be more specific about what caused the invalid state. Would you consider including more context in the error messages? Wdyt?

-                "Invalid state as stream slices that are emitted should refer to an existing cursor"
+                f"Invalid state: Partition key '{partition_key}' not found in cursor_per_partition. This might indicate a race condition or premature record emission."
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9574f8c and 36c4992.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

216-218: ⚠️ Potential issue

Review partition limit logic.

The limit_reached method returns true when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER, but _over_limit is incremented each time we remove a partition. This means we'll only return true after exceeding the limit by the full amount. Should this be >= instead? Wdyt?

-        return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER
+        return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBER

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

64-98: Consider adding type hints for class variables.

The initialization is well-structured, but adding type hints for class variables would improve code clarity and help catch potential type-related issues early. Wdyt about adding them like this?

-    _cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict()
-    _semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
-    _finished_partitions: set[str] = set()
-    _global_cursor: Optional[StreamState] = {}
-    _new_global_cursor: Optional[StreamState] = None
-    _lookback_window: int = 0
-    _parent_state: Optional[StreamState] = None
-    _over_limit: int = 0

104-123: Consider optimizing state generation for large number of partitions.

The state property performs deep copies of cursor states for each partition. This could become a performance bottleneck with many partitions. Would it make sense to optimize this by:

  1. Processing states incrementally?
  2. Using shallow copies where possible?
  3. Implementing a caching mechanism?

Wdyt?


224-291: Consider adding error handling for invalid state format.

The _set_initial_state method has good documentation but could benefit from explicit error handling for invalid state formats. For example:

     if not stream_state:
         return

+    if not isinstance(stream_state, dict):
+        raise ValueError(f"Invalid state format. Expected dict, got {type(stream_state)}")
+
     if self._PERPARTITION_STATE_KEY not in stream_state:
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

296-349: Consider extracting complex condition to a method.

The condition for checking if a stream supports concurrent partition processing with global cursor is quite complex. Would it make sense to extract it to a separate method for better readability? Something like:

+    def _supports_concurrent_partition_with_global_cursor(
+        self,
+        declarative_stream: DeclarativeStream,
+        incremental_sync_component_definition: Mapping[str, Any],
+    ) -> bool:
+        return (
+            incremental_sync_component_definition
+            and incremental_sync_component_definition.get("type", "")
+            == DatetimeBasedCursorModel.__name__
+            and self._stream_supports_concurrent_partition_processing(
+                declarative_stream=declarative_stream
+            )
+            and hasattr(declarative_stream.retriever, "stream_slicer")
+            and isinstance(
+                declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor
+            )
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36c4992 and daa6873.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (5 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1)

67-69: LGTM! Simple and effective implementation.

The is_running method provides a clean way to check the timer's state. The implementation is thread-safe and follows the single responsibility principle.

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

30-39: LGTM! Clean factory implementation.

The factory pattern is well implemented with a clear separation of concerns. The create method signature is explicit about its parameters and return type.


208-220: Verify partition limit logic.

In _ensure_partition_limit, we increment _over_limit every time we remove a partition when the limit is reached. However, this could lead to edge cases where partitions are unnecessarily dropped. Should we consider:

  1. Resetting _over_limit when partitions are successfully processed?
  2. Using a different metric for tracking over-limit conditions?

Wdyt?


221-223: Potential off-by-one error in limit check.

The limit_reached method returns True when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Should this be >= instead? The current logic means we need to exceed the limit by more than one partition before considering it reached. Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

440-459: LGTM! Well-documented optimization.

The _get_retriever method contains important optimizations for cursor and state management. The comments clearly explain the rationale behind the changes.

@tolik0
Copy link
Contributor Author

tolik0 commented Jan 17, 2025

Regression tests for Jira: GitHub Action

Failure: test_all_records_are_the_same_with_state

Issues:

  1. project_avatars: projectId is added by a transformation, so id does not uniquely identify the record.
  2. screen_tab_fields: Similar to project_avatars, screenId and tabId are added by a transformation, so id does not uniquely identify the record.

Second Jira run - no errors
Third Jira run

Regression tests for Snapchat-marketing - 1, 2.
The error with delivery status is because it is a list with elements in random order; I got the same result locally.

Regression tests for harvest - 1, 2.

Regression tests for Klaviyo: Github Action

Regression tests for Chargebee (no incremental substreams): GitHub Action

Failure: The contact stream read test failed due to its update to concurrency. It no longer uses resumable full refresh state, which resulted in an increased number of records.

Catalog Changes:
4 streams were updated: attached_item, contact, quote_line_group, and subscription_with_scheduled_changes. These are full refresh streams with parent streams.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say It no longer uses resumable full refresh state, which resulted in an increased number of records. for source-chargebee contact stream, it is when executed with a state value as an input, correct?

If the above is true, I'm all good with this change. Thanks a lot for the due diligence on this one.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

139-141: 🛠️ Refactor suggestion

Avoid accessing private attributes of Semaphore

In this section, the code accesses the private attribute _value of threading.Semaphore, which is intended to be internal to the class. Direct access to private attributes may lead to maintenance issues and unexpected behavior. Could we use an alternative approach to check the semaphore's state without accessing _value? For instance, we might track the semaphore count separately or restructure the logic to avoid the need for this check. Wdyt?


154-156: 🛠️ Refactor suggestion

Avoid accessing private attributes of Semaphore

Similar to a previous comment, accessing the _value attribute of threading.Semaphore directly is not recommended as it is meant to be private. Perhaps we could find another way to determine if all semaphores have a value of zero without accessing private attributes. This could involve maintaining a separate counter or using higher-level methods provided by the semaphore. Wdyt?

🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

93-94: Consider typing _lookback_window as Optional[int]

Currently, _lookback_window is typed as int and initialized to 0, but elsewhere in the code, it is compared against None, implying that it can be None. Perhaps updating the type annotation to Optional[int] and initializing it to None would improve clarity and prevent potential type inconsistencies. Wdyt?


213-220: Reevaluate the necessity of the _over_limit variable

The _over_limit variable is incremented whenever the partition limit is exceeded but appears to be used only in the log message. If it's not contributing to the functionality beyond logging, perhaps we could remove it or consider logging the count of dropped partitions directly. This might simplify the code and reduce unnecessary state management. Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

944-955: Consider copying stream_state before modification

In this segment, stream_state is modified in place when applying the runtime_lookback_window. Modifying input parameters directly can lead to unintended side effects elsewhere in the code if stream_state is used after this function call. Perhaps creating a copy of stream_state before applying the modification would prevent potential issues. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between daa6873 and c827d82.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1026-1026: Verify the fallback for message_repository

Here, message_repository defaults to self._message_repository if not provided. Could there be cases where self._message_repository is not initialized, leading to potential AttributeErrors? Perhaps adding a check to ensure self._message_repository is available or providing a sensible default would enhance robustness. Wdyt?


2201-2204: Simplify the type checking condition

The current condition for checking the type of stream_slicer is a bit complex and might not handle subclasses correctly:

if (
    not isinstance(stream_slicer, DatetimeBasedCursor)
    or type(stream_slicer) is not DatetimeBasedCursor
) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):

Would it be clearer and more robust to simplify the condition using isinstance with a tuple of allowed types?

if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):

This approach ensures that any subclass of the allowed types is also accepted. Wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

66-100: Consider documenting thread safety expectations.

The class uses threading primitives like Lock and Semaphore, but the thread safety expectations and guarantees aren't documented. Would it be helpful to add a section in the docstring about thread safety considerations and any assumptions made? This could help future maintainers understand the concurrency model better.

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)

441-460: Consider enhancing method documentation.

The _get_retriever method contains important optimizations for cursor and state management. Would it be helpful to add more detailed docstring explaining:

  • The purpose of zeroing out the cursor
  • Why state management is handled through ConcurrentCursor
  • The implications for components like StopConditionPaginationStrategyDecorator
    Wdyt?

297-350: Consider extracting cursor creation logic.

The stream grouping logic contains complex nested conditions for cursor creation. Would it make sense to extract this into a separate method like _create_perpartition_cursor to improve readability and maintainability? This would also make the logic easier to test. Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

990-1000: Consider adding error handling for state parsing.

When adjusting the stream state with runtime lookback window, we parse and format timestamps without explicit error handling. Should we add try-catch blocks to handle potential parsing errors and provide meaningful error messages? Wdyt?


1084-1139: Document NoopMessageRepository usage.

The method uses NoopMessageRepository when creating the cursor factory. Would it be helpful to add a comment explaining why we use a no-op repository here instead of passing through the original message repository? This design choice might not be immediately obvious to other developers. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c827d82 and 471ff7e.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (5 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (6)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (6)

30-40: LGTM! Clean factory implementation.

The factory pattern is well implemented with clear separation of concerns.


106-125: Potential performance issue when generating state.

The state property method iterates over self._cursor_per_partition.items() and performs deep copies of cursor states. This could become a performance bottleneck with a large number of partitions. Should we consider optimizing this by avoiding deep copies or processing states incrementally? Wdyt?


127-150: Avoid accessing private attributes like _value of Semaphore.

In the check if self._semaphore_per_partition[partition_key]._value == 0, we're accessing the internal _value attribute of threading.Semaphore, which is intended to be private. Should we consider an alternative approach to track the semaphore state without accessing private attributes? Wdyt?


175-182: LGTM! Good error handling for timer state.

The check for already running timer prevents potential issues with multiple executions.


210-222: Clarification on partition limit logic.

In _ensure_partition_limit, we increment _over_limit every time we remove a partition when the limit is reached. However, in limit_reached, we check if _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Is this the intended behavior? Should the condition be adjusted to properly reflect when the limit is truly exceeded? Wdyt?


295-303: Should we check if record.associated_slice is None before accessing partition?

In the observe method, we access record.associated_slice.partition, but if record.associated_slice is None, this could raise an error. Would it make sense to include a check to handle cases where associated_slice might be None? Wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

89-91: Should we add type annotations to these instance variables?

Adding type hints would improve code clarity and help catch potential type-related issues early:

-        self._cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict()
-        self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
-        self._finished_partitions: set[str] = set()
+        self._cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict()
+        self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
+        self._finished_partitions: set[str] = set()

107-125: Consider optimizing state generation to avoid deep copies.

The state property performs deep copies of cursor states for each partition. For large numbers of partitions, this could become a performance bottleneck. Should we consider:

  1. Using shallow copies where possible?
  2. Implementing incremental state updates?
  3. Caching the state if it hasn't changed?

Wdyt?


211-223: Should we clarify the partition limit logic?

In _ensure_partition_limit, we increment _over_limit when removing partitions that exceed the limit. However, this counter seems to be used differently in other parts of the code. Would it be helpful to:

  1. Add a comment explaining the purpose of _over_limit?
  2. Consider renaming it to better reflect its role (e.g., _dropped_partitions_count)?

Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

999-1010: Consider adding error handling for state adjustment.

When adjusting the stream state with runtime lookback window, we should consider:

  1. Adding validation for negative lookback windows
  2. Handling potential parsing errors from parse_timestamp
  3. Adding logging for state adjustments

Wdyt?


2377-2377: Should we simplify the type checking condition?

The current condition:

not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursor

Could be simplified to:

not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor))

This would be more maintainable and handle subclasses correctly. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 471ff7e and 19fe9c4.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (7 hunks)
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/streams/concurrent/cursor.py
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

42-56: LGTM! Well-documented class with clear docstrings.

The class documentation clearly explains the purpose, attributes, and key concepts of partition limitation logic and global cursor fallback. The docstring follows best practices with a clear structure.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

932-933: LGTM! Good addition of optional parameters.

The addition of message_repository and runtime_lookback_window parameters enhances flexibility in cursor creation.

@tolik0 tolik0 merged commit 4459243 into main Jan 23, 2025
19 of 23 checks passed
@tolik0 tolik0 deleted the tolik0/concurrent-perpartitioncursor branch January 23, 2025 08:53
rpopov pushed a commit to rpopov/airbyte-python-cdk that referenced this pull request Jan 23, 2025
* remotes/airbyte/main:
  fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
  feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
  feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
  fix: don't mypy unit_tests (airbytehq#241)
  fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
  feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
  ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244)
  Fix(sdm): module ref issue in python components import (airbytehq#243)
  feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
  chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
  docs: comments on what the `Dockerfile` is for (airbytehq#240)
  chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants