Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: remove Base prefix from abstract class names #980

Merged
merged 2 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/guides/http_clients.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,4 @@ python -m pip install 'crawlee[all]'

## How HTTP clients work

We provide an abstract base class, <ApiLink to="class/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the <ApiLink to="class/BaseHttpClient">`BaseHttpClient`</ApiLink> class and implement the required methods.
We provide an abstract base class, <ApiLink to="class/HttpClient">`HttpClient`</ApiLink>, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the <ApiLink to="class/HttpClient">`HttpClient`</ApiLink> class and implement the required methods.
4 changes: 2 additions & 2 deletions docs/guides/storages.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Crawlee offers multiple storage types for managing and persisting your crawling

## Storage clients

Storage clients in Crawlee are subclasses of <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. They handle interactions with different storage backends. For instance:
Storage clients in Crawlee are subclasses of <ApiLink to="class/StorageClient">`StorageClient`</ApiLink>. They handle interactions with different storage backends. For instance:

- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>: Stores data in memory and persists it to the local file system.
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
Expand All @@ -52,7 +52,7 @@ where:
- `{STORAGE_ID}`: The ID of the specific storage instance (default: `default`).

:::info NOTE
The current <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLLite](https://sqlite.org/).
The current <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLink to="class/StorageClient">`StorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLite](https://sqlite.org/).
:::

You can override default storage IDs using these environment variables: `CRAWLEE_DEFAULT_DATASET_ID`, `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`, or `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID`.
Expand Down
7 changes: 7 additions & 0 deletions docs/upgrading/upgrading_to_v0x.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,13 @@ This section summarizes the breaking changes between v0.5.x and v0.6.0.

The `Configuration` fields `chrome_executable_path`, `xvfb`, and `verbose_log` have been removed. The `chrome_executable_path` and `xvfb` fields were unused, while `verbose_log` can be replaced by setting `log_level` to `DEBUG`.

### Abstract base classes

We decided to move away from [Hungarian notation](https://en.wikipedia.org/wiki/Hungarian_notation) and remove all the `Base` prefixes from the abstract classes. It includes the following public classes:
- `BaseStorageClient` -> `StorageClient`
- `BaseBrowserController` -> `BrowserController`
- `BaseBrowserPlugin` -> `BrowserPlugin`

## Upgrading to v0.5

This section summarizes the breaking changes between v0.4.x and v0.5.0.
Expand Down
10 changes: 5 additions & 5 deletions src/crawlee/_service_locator.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from crawlee.configuration import Configuration
from crawlee.errors import ServiceConflictError
from crawlee.events import EventManager
from crawlee.storage_clients import BaseStorageClient
from crawlee.storage_clients import StorageClient


@docs_group('Classes')
Expand All @@ -17,7 +17,7 @@ class ServiceLocator:
def __init__(self) -> None:
self._configuration: Configuration | None = None
self._event_manager: EventManager | None = None
self._storage_client: BaseStorageClient | None = None
self._storage_client: StorageClient | None = None

# Flags to check if the services were already set.
self._configuration_was_retrieved = False
Expand Down Expand Up @@ -74,7 +74,7 @@ def set_event_manager(self, event_manager: EventManager) -> None:

self._event_manager = event_manager

def get_storage_client(self) -> BaseStorageClient:
def get_storage_client(self) -> StorageClient:
"""Get the storage client."""
if self._storage_client is None:
from crawlee.storage_clients import MemoryStorageClient
Expand All @@ -88,7 +88,7 @@ def get_storage_client(self) -> BaseStorageClient:
self._storage_client_was_retrieved = True
return self._storage_client

def set_storage_client(self, storage_client: BaseStorageClient) -> None:
def set_storage_client(self, storage_client: StorageClient) -> None:
"""Set the storage client.

Args:
Expand All @@ -98,7 +98,7 @@ def set_storage_client(self, storage_client: BaseStorageClient) -> None:
ServiceConflictError: If the storage client has already been retrieved before.
"""
if self._storage_client_was_retrieved:
raise ServiceConflictError(BaseStorageClient, storage_client, self._storage_client)
raise ServiceConflictError(StorageClient, storage_client, self._storage_client)

self._storage_client = storage_client

Expand Down
22 changes: 15 additions & 7 deletions src/crawlee/browsers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
try:
from crawlee._utils.try_import import install_import_hook as _install_import_hook
from crawlee._utils.try_import import try_import as _try_import

_install_import_hook(__name__)

# The following imports are wrapped in try_import to handle optional dependencies,
# ensuring the module can still function even if these dependencies are missing.
with _try_import(__name__, 'BrowserPool'):
from ._browser_pool import BrowserPool
with _try_import(__name__, 'PlaywrightBrowserController'):
from ._playwright_browser_controller import PlaywrightBrowserController
with _try_import(__name__, 'PlaywrightBrowserPlugin'):
from ._playwright_browser_plugin import PlaywrightBrowserPlugin
except ImportError as exc:
raise ImportError(
"To import this, you need to install the 'playwright' extra. "
"For example, if you use pip, run `pip install 'crawlee[playwright]'`.",
) from exc

__all__ = ['BrowserPool', 'PlaywrightBrowserController', 'PlaywrightBrowserPlugin']
__all__ = [
'BrowserPool',
'PlaywrightBrowserController',
'PlaywrightBrowserPlugin',
]
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
from crawlee.proxy_configuration import ProxyInfo


class BaseBrowserController(ABC):
"""An abstract class for managing browser instance and their pages."""
class BrowserController(ABC):
"""An abstract base class for managing browser instance and their pages."""

AUTOMATION_LIBRARY: str | None = None
"""The name of the automation library that the controller is using."""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
from collections.abc import Mapping
from types import TracebackType

from crawlee.browsers._base_browser_controller import BaseBrowserController
from crawlee.browsers._browser_controller import BrowserController
from crawlee.browsers._types import BrowserType


class BaseBrowserPlugin(ABC):
class BrowserPlugin(ABC):
"""An abstract base class for browser plugins.

Browser plugins act as wrappers around browser automation tools like Playwright,
Expand Down Expand Up @@ -59,7 +59,7 @@ def max_open_pages_per_browser(self) -> int:
"""Return the maximum number of pages that can be opened in a single browser."""

@abstractmethod
async def __aenter__(self) -> BaseBrowserPlugin:
async def __aenter__(self) -> BrowserPlugin:
"""Enter the context manager and initialize the browser plugin.

Raises:
Expand All @@ -80,7 +80,7 @@ async def __aexit__(
"""

@abstractmethod
async def new_browser(self) -> BaseBrowserController:
async def new_browser(self) -> BrowserController:
"""Create a new browser instance.

Returns:
Expand Down
26 changes: 13 additions & 13 deletions src/crawlee/browsers/_browser_pool.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@
from crawlee._utils.crypto import crypto_random_object_id
from crawlee._utils.docs import docs_group
from crawlee._utils.recurring_task import RecurringTask
from crawlee.browsers._base_browser_controller import BaseBrowserController
from crawlee.browsers._browser_controller import BrowserController
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers._types import BrowserType, CrawleePage

if TYPE_CHECKING:
from collections.abc import Mapping, Sequence
from types import TracebackType

from crawlee.browsers._base_browser_plugin import BaseBrowserPlugin
from crawlee.browsers._browser_plugin import BrowserPlugin
from crawlee.fingerprint_suite import FingerprintGenerator
from crawlee.proxy_configuration import ProxyInfo

Expand All @@ -46,7 +46,7 @@ class BrowserPool:

def __init__(
self,
plugins: Sequence[BaseBrowserPlugin] | None = None,
plugins: Sequence[BrowserPlugin] | None = None,
*,
operation_timeout: timedelta = timedelta(seconds=15),
browser_inactive_threshold: timedelta = timedelta(seconds=10),
Expand All @@ -72,10 +72,10 @@ def __init__(
self._operation_timeout = operation_timeout
self._browser_inactive_threshold = browser_inactive_threshold

self._active_browsers = list[BaseBrowserController]()
self._active_browsers = list[BrowserController]()
"""A list of browsers currently active and being used to open pages."""

self._inactive_browsers = list[BaseBrowserController]()
self._inactive_browsers = list[BrowserController]()
"""A list of browsers currently inactive and not being used to open new pages,
but may still contain open pages."""

Expand Down Expand Up @@ -145,17 +145,17 @@ def with_default_plugin(
return cls(plugins=[plugin], **kwargs)

@property
def plugins(self) -> Sequence[BaseBrowserPlugin]:
def plugins(self) -> Sequence[BrowserPlugin]:
"""Return the browser plugins."""
return self._plugins

@property
def active_browsers(self) -> Sequence[BaseBrowserController]:
def active_browsers(self) -> Sequence[BrowserController]:
"""Return the active browsers in the pool."""
return self._active_browsers

@property
def inactive_browsers(self) -> Sequence[BaseBrowserController]:
def inactive_browsers(self) -> Sequence[BrowserController]:
"""Return the inactive browsers in the pool."""
return self._inactive_browsers

Expand Down Expand Up @@ -230,7 +230,7 @@ async def new_page(
self,
*,
page_id: str | None = None,
browser_plugin: BaseBrowserPlugin | None = None,
browser_plugin: BrowserPlugin | None = None,
proxy_info: ProxyInfo | None = None,
) -> CrawleePage:
"""Open a new page in a browser using the specified or a random browser plugin.
Expand Down Expand Up @@ -272,7 +272,7 @@ async def new_page_with_each_plugin(self) -> Sequence[CrawleePage]:
async def _get_new_page(
self,
page_id: str,
plugin: BaseBrowserPlugin,
plugin: BrowserPlugin,
proxy_info: ProxyInfo | None,
) -> CrawleePage:
"""Internal method to initialize a new page in a browser using the specified plugin."""
Expand Down Expand Up @@ -301,16 +301,16 @@ async def _get_new_page(

def _pick_browser_with_free_capacity(
self,
browser_plugin: BaseBrowserPlugin,
) -> BaseBrowserController | None:
browser_plugin: BrowserPlugin,
) -> BrowserController | None:
"""Pick a browser with free capacity that matches the specified plugin."""
for browser in self._active_browsers:
if browser.has_free_capacity and browser.AUTOMATION_LIBRARY == browser_plugin.AUTOMATION_LIBRARY:
return browser

return None

async def _launch_new_browser(self, plugin: BaseBrowserPlugin) -> BaseBrowserController:
async def _launch_new_browser(self, plugin: BrowserPlugin) -> BrowserController:
"""Launch a new browser instance using the specified plugin."""
browser = await plugin.new_browser()
self._active_browsers.append(browser)
Expand Down
4 changes: 2 additions & 2 deletions src/crawlee/browsers/_playwright_browser_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from typing_extensions import override

from crawlee._utils.docs import docs_group
from crawlee.browsers._base_browser_controller import BaseBrowserController
from crawlee.browsers._browser_controller import BrowserController
from crawlee.browsers._types import BrowserType
from crawlee.fingerprint_suite import HeaderGenerator

Expand All @@ -28,7 +28,7 @@


@docs_group('Classes')
class PlaywrightBrowserController(BaseBrowserController):
class PlaywrightBrowserController(BrowserController):
"""Controller for managing Playwright browser instances and their pages.

It provides methods to control browser instances, manage their pages, and handle context-specific
Expand Down
4 changes: 2 additions & 2 deletions src/crawlee/browsers/_playwright_browser_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from crawlee import service_locator
from crawlee._utils.context import ensure_context
from crawlee._utils.docs import docs_group
from crawlee.browsers._base_browser_plugin import BaseBrowserPlugin
from crawlee.browsers._browser_plugin import BrowserPlugin
from crawlee.browsers._playwright_browser_controller import PlaywrightBrowserController

if TYPE_CHECKING:
Expand All @@ -25,7 +25,7 @@


@docs_group('Classes')
class PlaywrightBrowserPlugin(BaseBrowserPlugin):
class PlaywrightBrowserPlugin(BrowserPlugin):
"""A plugin for managing Playwright automation library.

It is a plugin designed to manage browser instances using the Playwright automation library. It acts as a factory
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ class AbstractHttpCrawler(

The `AbstractHttpCrawler` builds on top of the `BasicCrawler`, inheriting all its features. Additionally,
it implements HTTP communication using HTTP clients. The class allows integration with any HTTP client
that implements the `BaseHttpClient` interface, provided as an input parameter to the constructor.
that implements the `HttpClient` interface, provided as an input parameter to the constructor.

`AbstractHttpCrawler` is a generic class intended to be used with a specific parser for parsing HTTP responses
and the expected type of `TCrawlingContext` available to the user function. Examples of specific versions include
Expand Down
20 changes: 12 additions & 8 deletions src/crawlee/crawlers/_adaptive_playwright/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,21 @@
try:
from ._rendering_type_predictor import RenderingType, RenderingTypePrediction, RenderingTypePredictor
except ImportError as exc:
raise ImportError(
"To import this, you need to install the 'adaptive-playwright' extra. "
"For example, if you use pip, run `pip install 'crawlee[adaptive-playwright]'`.",
) from exc
from crawlee._utils.try_import import install_import_hook as _install_import_hook
from crawlee._utils.try_import import try_import as _try_import

from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
# These imports have only mandatory dependencies, so they are imported directly.
from ._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)

_install_import_hook(__name__)

# The following imports are wrapped in try_import to handle optional dependencies,
# ensuring the module can still function even if these dependencies are missing.
with _try_import(__name__, 'BeautifulSoupCrawler'):
from ._rendering_type_predictor import RenderingType, RenderingTypePrediction, RenderingTypePredictor
with _try_import(__name__, 'BeautifulSoupCrawlingContext'):
from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawler

__all__ = [
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,7 @@
from crawlee import HttpHeaders
from crawlee._types import BasicCrawlingContext
from crawlee._utils.docs import docs_group
from crawlee.crawlers import (
AbstractHttpParser,
ParsedHttpCrawlingContext,
PlaywrightCrawlingContext,
)
from crawlee.crawlers import AbstractHttpParser, ParsedHttpCrawlingContext, PlaywrightCrawlingContext

if TYPE_CHECKING:
from collections.abc import Awaitable, Callable
Expand Down
12 changes: 6 additions & 6 deletions src/crawlee/crawlers/_basic/_basic_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,12 @@
from crawlee._types import ConcurrencySettings, HttpMethod, JsonSerializable
from crawlee.configuration import Configuration
from crawlee.events import EventManager
from crawlee.http_clients import BaseHttpClient, HttpResponse
from crawlee.http_clients import HttpClient, HttpResponse
from crawlee.proxy_configuration import ProxyConfiguration, ProxyInfo
from crawlee.request_loaders import RequestManager
from crawlee.sessions import Session
from crawlee.statistics import FinalStatistics
from crawlee.storage_clients import BaseStorageClient
from crawlee.storage_clients import StorageClient
from crawlee.storage_clients.models import DatasetItemsListPage
from crawlee.storages._dataset import ExportDataCsvKwargs, ExportDataJsonKwargs, GetDataKwargs, PushDataKwargs

Expand All @@ -80,7 +80,7 @@ class _BasicCrawlerOptions(TypedDict):
event_manager: NotRequired[EventManager]
"""The event manager for managing events for the crawler and all its components."""

storage_client: NotRequired[BaseStorageClient]
storage_client: NotRequired[StorageClient]
"""The storage client for managing storages for the crawler and all its components."""

request_manager: NotRequired[RequestManager]
Expand All @@ -92,7 +92,7 @@ class _BasicCrawlerOptions(TypedDict):
proxy_configuration: NotRequired[ProxyConfiguration]
"""HTTP proxy configuration used when making requests."""

http_client: NotRequired[BaseHttpClient]
http_client: NotRequired[HttpClient]
"""HTTP client used by `BasicCrawlingContext.send_request` method."""

max_request_retries: NotRequired[int]
Expand Down Expand Up @@ -202,11 +202,11 @@ def __init__(
*,
configuration: Configuration | None = None,
event_manager: EventManager | None = None,
storage_client: BaseStorageClient | None = None,
storage_client: StorageClient | None = None,
request_manager: RequestManager | None = None,
session_pool: SessionPool | None = None,
proxy_configuration: ProxyConfiguration | None = None,
http_client: BaseHttpClient | None = None,
http_client: HttpClient | None = None,
request_handler: Callable[[TCrawlingContext], Awaitable[None]] | None = None,
max_request_retries: int = 3,
max_requests_per_crawl: int | None = None,
Expand Down
Loading
Loading