-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111
Conversation
📝 WalkthroughWalkthroughThe pull request introduces significant enhancements to the Airbyte CDK's concurrent declarative source handling, focusing on improving incremental sync capabilities for low-code connectors. The changes primarily revolve around extending support for different cursor types, particularly Changes
Possibly related PRs
Suggested reviewersHey there! 👋 I noticed some interesting improvements in the concurrent declarative source handling. A few questions for you:
Would love to hear your thoughts! 🚀 ✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
59-66
: Consider simplifying the constructor parametersThe
__init__
method ofConcurrentPerPartitionCursor
has many parameters, which can make it harder to maintain and understand. Would it be beneficial to encapsulate related parameters into data classes or reduce the number of parameters if possible? Wdyt?airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)
23-25
: Unused import statementWe import
PerPartitionWithGlobalCursor
, but it's not used elsewhere in the code. Should we remove this import to keep the code clean? Wdyt?
309-361
: Refactor duplicated code in_group_streams
methodThere seems to be duplicated code in the conditional blocks handling
DatetimeBasedCursorModel
streams. Could we refactor these blocks into a helper function to reduce redundancy and improve readability? Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
915-970
: Consider reducing code duplicationThe methods
create_concurrent_cursor_from_datetime_based_cursor
andcreate_concurrent_cursor_from_perpartition_cursor
share similar logic. Could we extract the common parts into a shared helper function to simplify maintenance? Wdyt?airbyte_cdk/sources/streams/concurrent/cursor.py (1)
243-251
: Consider extracting common logic between close methods?The new
close_partition_without_emit
shares a lot of logic withclose_partition
. What do you think about extracting the common logic into a private method to reduce duplication? Something like this, wdyt?+ def _close_partition_internal(self, partition: Partition) -> bool: + slice_count_before = len(self.state.get("slices", [])) + self._add_slice_to_state(partition) + should_merge = slice_count_before < len(self.state["slices"]) + if should_merge: + self._merge_partitions() + self._has_closed_at_least_one_slice = True + return should_merge + def close_partition_without_emit(self, partition: Partition) -> None: - slice_count_before = len(self.state.get("slices", [])) - self._add_slice_to_state(partition) - if slice_count_before < len( - self.state["slices"] - ): # only emit if at least one slice has been processed - self._merge_partitions() - self._has_closed_at_least_one_slice = True + self._close_partition_internal(partition) def close_partition(self, partition: Partition) -> None: - slice_count_before = len(self.state.get("slices", [])) - self._add_slice_to_state(partition) - if slice_count_before < len( - self.state["slices"] - ): # only emit if at least one slice has been processed - self._merge_partitions() + if self._close_partition_internal(partition): self._emit_state_message() - self._has_closed_at_least_one_slice = True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
(1 hunks)airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
25-29
: Question about returning only the first cursor
In the ConcurrentCursorFactory
, the create
method returns only the first cursor from _create_function
. Since _create_function
returns a tuple of cursors, should we ensure that we're not discarding any necessary cursors? Perhaps we should handle all cursors returned. Wdyt?
221-224
: Ensure partition key serialization handles edge cases
In _to_partition_key
and _to_dict
, we use self._partition_serializer
. Do we need to ensure that partition keys are properly sanitized or encoded to handle special characters and prevent serialization issues? Wdyt?
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
317-318
: Type checking with isinstance
We use isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor)
to check the type. Should we consider using duck typing or interface checks to allow for subclasses or alternative implementations that fulfill the same protocol? Wdyt?
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)
41-66
: Clarify the purpose of DeclarativePartitionFactory1
The new class DeclarativePartitionFactory1
appears to be similar to DeclarativePartitionFactory
but with a different approach to handling the retriever. Is this intended to replace the original factory or serve a different purpose? Providing more context or renaming the class for clarity might help. Wdyt?
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
181-181
: LGTM!
The change to use request_option_provider
instead of stream_slicer
improves modularity by separating concerns.
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
152-156
: Could we simplify the partition limit condition in_ensure_partition_limit
method?Currently, the condition is
len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1
. Would it improve readability to uselen(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER
instead? This change might make the maximum partition limit clearer. Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
910-965
: Consider extracting common validation logic and fixing indentation.A few suggestions to improve this new method:
- The component type validation logic (lines 922-926) is duplicated from create_concurrent_cursor_from_datetime_based_cursor. Should we extract it into a helper method to follow DRY?
- The indentation in the return statement (lines 955-964) seems inconsistent with the rest of the file. Should we align it with the standard indentation?
- Instead of using type ignore on line 961, we could properly type message_repository. Wdyt?
def create_concurrent_cursor_from_perpartition_cursor( self, state_manager: ConnectorStateManager, model_type: Type[BaseModel], component_definition: ComponentDefinition, stream_name: str, stream_namespace: Optional[str], config: Config, stream_state: MutableMapping[str, Any], partition_router, **kwargs: Any, ) -> ConcurrentPerPartitionCursor: - component_type = component_definition.get("type") - if component_definition.get("type") != model_type.__name__: - raise ValueError( - f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead" - ) + self._validate_component_type(component_definition, model_type) # ... rest of the method ... return ConcurrentPerPartitionCursor( - cursor_factory=cursor_factory, - partition_router=partition_router, - stream_name=stream_name, - stream_namespace=stream_namespace, - stream_state=stream_state, - message_repository=self._message_repository, # type: ignore - connector_state_manager=state_manager, - cursor_field=cursor_field, - ) + cursor_factory=cursor_factory, + partition_router=partition_router, + stream_name=stream_name, + stream_namespace=stream_namespace, + stream_state=stream_state, + message_repository=self._message_repository, + connector_state_manager=state_manager, + cursor_field=cursor_field, + )Helper method to add:
def _validate_component_type(self, component_definition: ComponentDefinition, expected_type: Type[BaseModel]) -> None: component_type = component_definition.get("type") if component_definition.get("type") != expected_type.__name__: raise ValueError( f"Expected manifest component of type {expected_type.__name__}, but received {component_type} instead" )unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)
276-276
: Should the test case name match the function name for consistency?In the parameterization at line 276, the test name is
"test_incremental_parent_state"
, but the test function is namedtest_incremental_parent_state_no_incremental_dependency
. Aligning the test case name with the function name could enhance clarity. Wdyt?
483-495
: Could we improve the docstring formatting for better readability?The docstring contains a detailed explanation of the test, but it's indented inconsistently. Adjusting the indentation could enhance readability and maintain consistency with PEP 257 guidelines. Wdyt?
519-523
: Is comparing only the last state sufficient to validate the final state?In the assertion at lines 519-523, we compare
final_state[-1]
withexpected_state
. Should we consider comparing all elements offinal_state
to ensure that all state messages throughout the sync match the expectations? Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/record_filter.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/concurrent_declarative_source.py
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_filter.py (1)
78-81
: Is it safe to use an empty stream_name
when creating Record
instances?
In the filter_records
method, we're creating Record
instances with an empty stream_name
. Could this lead to issues if other parts of the codebase expect stream_name
to be non-empty? Should we consider passing the actual stream name to prevent potential problems? Wdyt?
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
243-251
: Could we clarify the behavior of close_partition_without_emit
method?
In the close_partition_without_emit
method, we update _has_closed_at_least_one_slice
to True
, but we don't emit a state message. Is this intended? Should we ensure that the state is eventually emitted elsewhere to keep the state consistent? Wdyt?
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
Outdated
Show resolved
Hide resolved
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
Outdated
Show resolved
Hide resolved
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)
Line range hint
341-432
: Comprehensive test coverage for cursor types! 👍The parameterized tests cover all important cursor scenarios:
- Basic datetime cursor
- Global substream cursor (with and without state)
- Per-partition cursor with various state configurations
One thought: would it be valuable to add a test case for error handling when an invalid cursor type is passed? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
unit_tests/sources/declarative/extractors/test_record_filter.py
(2 hunks)
🔇 Additional comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)
293-293
: Nice API simplification! 🎉
I see you've consolidated the cursor parameters (date_time_based_cursor
and substream_cursor
) into a single cursor
parameter. This makes the API cleaner and more flexible. The tests demonstrate it works well with all cursor types (DatetimeBasedCursor, GlobalSubstreamCursor, and PerPartitionWithGlobalCursor).
Also applies to: 431-431
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
305-316
: Consider extracting condition to a descriptive methodThe nested conditions make the code harder to read and maintain. Would you consider extracting these conditions into a separate method with a descriptive name? Wdyt?
Something like:
+def _supports_concurrent_perpartition_cursor( + self, + declarative_stream: DeclarativeStream, + incremental_sync_component_definition: Mapping[str, Any] +) -> bool: + return ( + incremental_sync_component_definition + and incremental_sync_component_definition.get("type", "") == DatetimeBasedCursorModel.__name__ + and self._stream_supports_concurrent_partition_processing(declarative_stream=declarative_stream) + and hasattr(declarative_stream.retriever, "stream_slicer") + and isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
- airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
84-85
:
Add missing comma between imports
There's a missing comma between ConcurrentCursorFactory
and ConcurrentPerPartitionCursor
in the import statement.
Apply this diff:
- ConcurrentCursorFactory
- ConcurrentPerPartitionCursor,
+ ConcurrentCursorFactory,
+ ConcurrentPerPartitionCursor,
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (2)
Line range hint 320-358
: LGTM! Well-structured comparison logic
The implementation includes proper null checks and validation for partition matching. The error messages are clear and descriptive.
306-319
:
Add null check for record.associated_slice
The code accesses record.associated_slice.partition
without verifying if record.associated_slice
is not None. This could lead to a NullPointerException.
Consider adding a null check like this:
def should_be_synced(self, record: Record) -> bool:
+ if not record.associated_slice:
+ raise ValueError("Record must have an associated slice")
if (
self._to_partition_key(record.associated_slice.partition)
not in self._cursor_per_partition
):
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
317-331
: Consider handling potential None stream_state
When retrieving stream_state
, we might want to handle the case where it could be None to prevent unexpected errors during cursor creation. Wdyt?
Let's verify the current handling of stream_state:
#!/bin/bash
# Check how stream_state is handled in other parts of the codebase
rg -A 3 "get_stream_state" | grep -v "test"
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very happy with the progress on this! I've added a couple of comments more in terms of code maintenance and structure but the functional part seemed fine. I want to check the tests eventually (tomorrow hopefully) but I can still leave a couple of comments here
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
a6b5daa
to
79ffb77
Compare
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)
306-319
:⚠️ Potential issueShould we add a null check for
record.associated_slice
?The code accesses
record.associated_slice.partition
without verifying ifrecord.associated_slice
is notNone
. This could raise anAttributeError
. Consider adding a null check, wdyt?def should_be_synced(self, record: Record) -> bool: + if not record.associated_slice: + raise ValueError("Record must have an associated slice") if ( self._to_partition_key(record.associated_slice.partition) not in self._cursor_per_partition ):unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)
244-268
:⚠️ Potential issueShould we pass the state parameter to the read method?
The
state
parameter is passed toConcurrentDeclarativeSource
initialization but not to theread
method (empty list is passed instead). This might affect state handling. Consider using the state parameter in both places, wdyt?- messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=[])) + messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=state))
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
962-1018
: Consider adding error handling for cursor creationThe cursor creation logic could benefit from additional error handling similar to
create_concurrent_cursor_from_datetime_based_cursor
. For instance, what happens if the cursor factory creation fails? Wdyt?def create_concurrent_cursor_from_perpartition_cursor( self, state_manager: ConnectorStateManager, model_type: Type[BaseModel], component_definition: ComponentDefinition, stream_name: str, stream_namespace: Optional[str], config: Config, stream_state: MutableMapping[str, Any], partition_router, **kwargs: Any, ) -> ConcurrentPerPartitionCursor: + try: component_type = component_definition.get("type") if component_definition.get("type") != model_type.__name__: raise ValueError( f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead" ) # Rest of the implementation... return ConcurrentPerPartitionCursor( cursor_factory=cursor_factory, partition_router=partition_router, stream_name=stream_name, stream_namespace=stream_namespace, stream_state=stream_state, message_repository=self._message_repository, connector_state_manager=state_manager, cursor_field=cursor_field, ) + except Exception as e: + raise ValueError( + f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}" + ) from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(7 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/streams/concurrent/cursor.py
- airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (5)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)
482-528
: LGTM! Well-structured test with clear documentation.
The test case is comprehensive and effectively validates the parent state behavior when incremental dependency is disabled.
530-630
: LGTM! Excellent test helper implementation.
The helper function is well-documented and implements thorough validation steps for testing incremental parent state behavior.
1124-1556
: LGTM! Comprehensive test coverage.
The test functions thoroughly cover different scenarios including state migration and empty record sets. The test cases are well-structured and documented.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
Line range hint 814-961
: LGTM! Well-structured implementation with proper error handling.
The implementation of create_concurrent_cursor_from_datetime_based_cursor
is thorough and handles all edge cases appropriately.
84-85
:
Add missing comma between imports
There seems to be a missing comma between ConcurrentCursorFactory
and ConcurrentPerPartitionCursor
. Wdyt?
- ConcurrentCursorFactory
- ConcurrentPerPartitionCursor,
+ ConcurrentCursorFactory,
+ ConcurrentPerPartitionCursor,
Likely invalid or redundant comment.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)
31-32
: Consider adding more specific type hints for create_function.The create_function parameter could benefit from more specific type hints. Would you consider updating it to explicitly show the expected parameter types? Wdyt?
- def __init__(self, create_function: Callable[..., ConcurrentCursor]): + def __init__( + self, + create_function: Callable[[Mapping[str, Any], Optional[Any]], ConcurrentCursor] + ):
42-54
: Consider enhancing documentation with thread safety details.The class uses locks and semaphores, but the documentation doesn't mention thread safety considerations. Would you like to add a section about thread safety and concurrent access patterns? Wdyt?
57-58
: Consider using frozen dictionaries for constants.The
_NO_STATE
and_NO_CURSOR_STATE
are mutable mappings. To prevent accidental modifications, would you consider usingMappingProxyType
orfrozendict
? Wdyt?- _NO_STATE: Mapping[str, Any] = {} - _NO_CURSOR_STATE: Mapping[str, Any] = {} + _NO_STATE: Mapping[str, Any] = MappingProxyType({}) + _NO_CURSOR_STATE: Mapping[str, Any] = MappingProxyType({})
102-121
: Consider optimizing state copying strategy.The
state
property performs deep copies of cursor states for each partition. For better performance with large states, would you consider:
- Using shallow copies where safe
- Implementing copy-on-write
- Caching the state if it hasn't changed
Wdyt?
306-314
: Enhance error messages for better debugging.The error messages could be more specific about what caused the invalid state. Would you consider including more context in the error messages? Wdyt?
- "Invalid state as stream slices that are emitted should refer to an existing cursor" + f"Invalid state: Partition key '{partition_key}' not found in cursor_per_partition. This might indicate a race condition or premature record emission."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
216-218
:⚠️ Potential issueReview partition limit logic.
The
limit_reached
method returns true when_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
, but_over_limit
is incremented each time we remove a partition. This means we'll only return true after exceeding the limit by the full amount. Should this be>=
instead? Wdyt?- return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBERLikely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
64-98
: Consider adding type hints for class variables.The initialization is well-structured, but adding type hints for class variables would improve code clarity and help catch potential type-related issues early. Wdyt about adding them like this?
- _cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict() - _semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() - _finished_partitions: set[str] = set() - _global_cursor: Optional[StreamState] = {} - _new_global_cursor: Optional[StreamState] = None - _lookback_window: int = 0 - _parent_state: Optional[StreamState] = None - _over_limit: int = 0
104-123
: Consider optimizing state generation for large number of partitions.The state property performs deep copies of cursor states for each partition. This could become a performance bottleneck with many partitions. Would it make sense to optimize this by:
- Processing states incrementally?
- Using shallow copies where possible?
- Implementing a caching mechanism?
Wdyt?
224-291
: Consider adding error handling for invalid state format.The
_set_initial_state
method has good documentation but could benefit from explicit error handling for invalid state formats. For example:if not stream_state: return + if not isinstance(stream_state, dict): + raise ValueError(f"Invalid state format. Expected dict, got {type(stream_state)}") + if self._PERPARTITION_STATE_KEY not in stream_state:airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
296-349
: Consider extracting complex condition to a method.The condition for checking if a stream supports concurrent partition processing with global cursor is quite complex. Would it make sense to extract it to a separate method for better readability? Something like:
+ def _supports_concurrent_partition_with_global_cursor( + self, + declarative_stream: DeclarativeStream, + incremental_sync_component_definition: Mapping[str, Any], + ) -> bool: + return ( + incremental_sync_component_definition + and incremental_sync_component_definition.get("type", "") + == DatetimeBasedCursorModel.__name__ + and self._stream_supports_concurrent_partition_processing( + declarative_stream=declarative_stream + ) + and hasattr(declarative_stream.retriever, "stream_slicer") + and isinstance( + declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor + ) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(5 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1)
67-69
: LGTM! Simple and effective implementation.The
is_running
method provides a clean way to check the timer's state. The implementation is thread-safe and follows the single responsibility principle.airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
30-39
: LGTM! Clean factory implementation.The factory pattern is well implemented with a clear separation of concerns. The create method signature is explicit about its parameters and return type.
208-220
: Verify partition limit logic.In
_ensure_partition_limit
, we increment_over_limit
every time we remove a partition when the limit is reached. However, this could lead to edge cases where partitions are unnecessarily dropped. Should we consider:
- Resetting
_over_limit
when partitions are successfully processed?- Using a different metric for tracking over-limit conditions?
Wdyt?
221-223
: Potential off-by-one error in limit check.The
limit_reached
method returnsTrue
when_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
. Should this be>=
instead? The current logic means we need to exceed the limit by more than one partition before considering it reached. Wdyt?airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
440-459
: LGTM! Well-documented optimization.The
_get_retriever
method contains important optimizations for cursor and state management. The comments clearly explain the rationale behind the changes.
Regression tests for Jira: GitHub Action Failure: Issues:
Second Jira run - no errors Regression tests for Snapchat-marketing - 1, 2. Regression tests for harvest - 1, 2. Regression tests for Klaviyo: Github Action Regression tests for Chargebee (no incremental substreams): GitHub Action Failure: The Catalog Changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say It no longer uses resumable full refresh state, which resulted in an increased number of records.
for source-chargebee contact stream, it is when executed with a state value as an input, correct?
If the above is true, I'm all good with this change. Thanks a lot for the due diligence on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
139-141
: 🛠️ Refactor suggestionAvoid accessing private attributes of
Semaphore
In this section, the code accesses the private attribute
_value
ofthreading.Semaphore
, which is intended to be internal to the class. Direct access to private attributes may lead to maintenance issues and unexpected behavior. Could we use an alternative approach to check the semaphore's state without accessing_value
? For instance, we might track the semaphore count separately or restructure the logic to avoid the need for this check. Wdyt?
154-156
: 🛠️ Refactor suggestionAvoid accessing private attributes of
Semaphore
Similar to a previous comment, accessing the
_value
attribute ofthreading.Semaphore
directly is not recommended as it is meant to be private. Perhaps we could find another way to determine if all semaphores have a value of zero without accessing private attributes. This could involve maintaining a separate counter or using higher-level methods provided by the semaphore. Wdyt?
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
93-94
: Consider typing_lookback_window
asOptional[int]
Currently,
_lookback_window
is typed asint
and initialized to0
, but elsewhere in the code, it is compared againstNone
, implying that it can beNone
. Perhaps updating the type annotation toOptional[int]
and initializing it toNone
would improve clarity and prevent potential type inconsistencies. Wdyt?
213-220
: Reevaluate the necessity of the_over_limit
variableThe
_over_limit
variable is incremented whenever the partition limit is exceeded but appears to be used only in the log message. If it's not contributing to the functionality beyond logging, perhaps we could remove it or consider logging the count of dropped partitions directly. This might simplify the code and reduce unnecessary state management. Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
944-955
: Consider copyingstream_state
before modificationIn this segment,
stream_state
is modified in place when applying theruntime_lookback_window
. Modifying input parameters directly can lead to unintended side effects elsewhere in the code ifstream_state
is used after this function call. Perhaps creating a copy ofstream_state
before applying the modification would prevent potential issues. Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
1026-1026
: Verify the fallback formessage_repository
Here,
message_repository
defaults toself._message_repository
if not provided. Could there be cases whereself._message_repository
is not initialized, leading to potentialAttributeError
s? Perhaps adding a check to ensureself._message_repository
is available or providing a sensible default would enhance robustness. Wdyt?
2201-2204
: Simplify the type checking conditionThe current condition for checking the type of
stream_slicer
is a bit complex and might not handle subclasses correctly:if ( not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursor ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):Would it be clearer and more robust to simplify the condition using
isinstance
with a tuple of allowed types?if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):This approach ensures that any subclass of the allowed types is also accepted. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
66-100
: Consider documenting thread safety expectations.The class uses threading primitives like Lock and Semaphore, but the thread safety expectations and guarantees aren't documented. Would it be helpful to add a section in the docstring about thread safety considerations and any assumptions made? This could help future maintainers understand the concurrency model better.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)
441-460
: Consider enhancing method documentation.The
_get_retriever
method contains important optimizations for cursor and state management. Would it be helpful to add more detailed docstring explaining:
- The purpose of zeroing out the cursor
- Why state management is handled through ConcurrentCursor
- The implications for components like StopConditionPaginationStrategyDecorator
Wdyt?
297-350
: Consider extracting cursor creation logic.The stream grouping logic contains complex nested conditions for cursor creation. Would it make sense to extract this into a separate method like
_create_perpartition_cursor
to improve readability and maintainability? This would also make the logic easier to test. Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
990-1000
: Consider adding error handling for state parsing.When adjusting the stream state with runtime lookback window, we parse and format timestamps without explicit error handling. Should we add try-catch blocks to handle potential parsing errors and provide meaningful error messages? Wdyt?
1084-1139
: Document NoopMessageRepository usage.The method uses
NoopMessageRepository
when creating the cursor factory. Would it be helpful to add a comment explaining why we use a no-op repository here instead of passing through the original message repository? This design choice might not be immediately obvious to other developers. Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(5 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (6)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (6)
30-40
: LGTM! Clean factory implementation.The factory pattern is well implemented with clear separation of concerns.
106-125
: Potential performance issue when generating state.The
state
property method iterates overself._cursor_per_partition.items()
and performs deep copies of cursor states. This could become a performance bottleneck with a large number of partitions. Should we consider optimizing this by avoiding deep copies or processing states incrementally? Wdyt?
127-150
: Avoid accessing private attributes like_value
ofSemaphore
.In the check
if self._semaphore_per_partition[partition_key]._value == 0
, we're accessing the internal_value
attribute ofthreading.Semaphore
, which is intended to be private. Should we consider an alternative approach to track the semaphore state without accessing private attributes? Wdyt?
175-182
: LGTM! Good error handling for timer state.The check for already running timer prevents potential issues with multiple executions.
210-222
: Clarification on partition limit logic.In
_ensure_partition_limit
, we increment_over_limit
every time we remove a partition when the limit is reached. However, inlimit_reached
, we check if_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
. Is this the intended behavior? Should the condition be adjusted to properly reflect when the limit is truly exceeded? Wdyt?
295-303
: Should we check ifrecord.associated_slice
isNone
before accessingpartition
?In the
observe
method, we accessrecord.associated_slice.partition
, but ifrecord.associated_slice
isNone
, this could raise an error. Would it make sense to include a check to handle cases whereassociated_slice
might beNone
? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
89-91
: Should we add type annotations to these instance variables?Adding type hints would improve code clarity and help catch potential type-related issues early:
- self._cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict() - self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() - self._finished_partitions: set[str] = set() + self._cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict() + self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() + self._finished_partitions: set[str] = set()
107-125
: Consider optimizing state generation to avoid deep copies.The
state
property performs deep copies of cursor states for each partition. For large numbers of partitions, this could become a performance bottleneck. Should we consider:
- Using shallow copies where possible?
- Implementing incremental state updates?
- Caching the state if it hasn't changed?
Wdyt?
211-223
: Should we clarify the partition limit logic?In
_ensure_partition_limit
, we increment_over_limit
when removing partitions that exceed the limit. However, this counter seems to be used differently in other parts of the code. Would it be helpful to:
- Add a comment explaining the purpose of
_over_limit
?- Consider renaming it to better reflect its role (e.g.,
_dropped_partitions_count
)?Wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
999-1010
: Consider adding error handling for state adjustment.When adjusting the stream state with runtime lookback window, we should consider:
- Adding validation for negative lookback windows
- Handling potential parsing errors from
parse_timestamp
- Adding logging for state adjustments
Wdyt?
2377-2377
: Should we simplify the type checking condition?The current condition:
not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursorCould be simplified to:
not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor))This would be more maintainable and handle subclasses correctly. Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(7 hunks)unit_tests/sources/declarative/parsers/test_model_to_component_factory.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/streams/concurrent/cursor.py
- unit_tests/sources/declarative/parsers/test_model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
42-56
: LGTM! Well-documented class with clear docstrings.The class documentation clearly explains the purpose, attributes, and key concepts of partition limitation logic and global cursor fallback. The docstring follows best practices with a clear structure.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
932-933
: LGTM! Good addition of optional parameters.The addition of
message_repository
andruntime_lookback_window
parameters enhances flexibility in cursor creation.
* remotes/airbyte/main: fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254) feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236) feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111) fix: don't mypy unit_tests (airbytehq#241) fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225) feat(concurrent cursor): attempt at clamping datetime (airbytehq#234) ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244) Fix(sdm): module ref issue in python components import (airbytehq#243) feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174) chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133) docs: comments on what the `Dockerfile` is for (airbytehq#240) chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Summary by CodeRabbit
New Features
Improvements
Technical Updates
ConcurrentDeclarativeSource
and improved error handling in tests.