-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The context.page.wait_for_selector() method does not properly raise a timeout. #976
Comments
Hi, could you please double check that you are importing correct As far as I can see, their TimeoutError does not inherit from the normal Python one, so I am guessing that is the problem? |
Thanks for the suggestion—I confirmed your theory. The evidence is below. Observation Crawlee raises a mix of standard Python exceptions and third-party library exceptions. Consequences Does this present an opportunity to make Crawlee easier to use and maintain, or did you just teach me something new about Playwright? I’m unsure whether to close this issue, so I’ll leave that decision to Crawlee’s designer. code # Standard imports
from typing import Optional, override
# 3rd party imports
from camoufox import AsyncNewBrowser
from crawlee.browsers import (
BrowserPool,
PlaywrightBrowserController,
PlaywrightBrowserPlugin,
)
from crawlee import Request
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.request_loaders import RequestList
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
# Local imports
from src.wat_crawlee.pinside_common import pinside_get_start_urls, pinside_extract_data
from src.wat_crawlee.proxies import get_proxies_monosans
class CamoufoxPlugin(PlaywrightBrowserPlugin):
"""A browser plugin that uses Camoufox browser, but otherwise keeps the functionality of
PlaywrightBrowserPlugin."""
@override
async def new_browser(self) -> PlaywrightBrowserController:
if not self._playwright:
raise RuntimeError("Playwright browser plugin is not initialized.")
return PlaywrightBrowserController(
browser=await AsyncNewBrowser(
self._playwright, **self._browser_launch_options
),
max_open_pages_per_browser=1, # Increase, if camoufox can handle it in your use case.
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
use_incognito_pages=True,
)
async def crawl_pinside(
max_requests_per_crawl: Optional[int],
) -> None:
# PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
crawler = PlaywrightCrawler(
max_requests_per_crawl=max_requests_per_crawl,
browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
proxy_configuration=ProxyConfiguration(proxy_urls=await get_proxies_monosans()),
request_manager=await RequestList(pinside_get_start_urls()).to_tandem(),
use_session_pool=False, # don't use sessions, start every fetch fresh
)
# Define a request handler to process each crawled page and attach it to the crawler using a decorator.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
try:
# await context.page.wait_for_load_state("networkidle")
# await context.page.wait_for_load_state("domcontentloaded")
# await context.page.locator("label:text('PinsideID')").wait_for(state="visible")
await context.page.wait_for_selector(
"div.grouped-label label:text('PinsideID')"
)
data = pinside_extract_data(
await context.page.content(), context.request.url, context.response.url
)
except TimeoutError as e:
await context.add_requests(
[Request.from_url(url=context.request.url, always_enqueue=True)]
)
context.log.warning(
f"{context.request.url} TIMEOUT {e} PROXY {context.proxy_info}"
)
except PlaywrightTimeoutError as e:
await context.add_requests(
[Request.from_url(url=context.request.url, always_enqueue=True)]
)
context.log.warning(
f"{context.request.url} PLAYWRIGHT_TIMEOUT {e} PROXY {context.proxy_info}"
)
except ValueError as e:
await context.add_requests(
[Request.from_url(url=context.request.url, always_enqueue=True)]
)
context.log.warning(
f"{context.request.url} VALUE {e} PROXY {context.proxy_info}"
)
except Exception as e:
await context.add_requests(
[Request.from_url(url=context.request.url, always_enqueue=True)]
)
context.log.warning(
f"{context.request.url} EXCEPTION {type(e).__name__}: {e} PROXY {context.proxy_info}"
)
else:
context.log.info(
f"{context.request.url} PROCESSED as {context.response.url}"
)
await context.push_data(data)
# Start the crawl.
await crawler.run()
# Export the entire dataset to a JSON file.
await crawler.export_data("pinside.json") log
|
That is not exactly the case - Crawlee does not raise that timeout error - in this case, it only acts as a wrapper around Playwright. In the request handler, you call functions on a
Yes, to work with Playwright, you need to understand it. Same goes for... anything, I guess.
I'm not sure I follow - care to elaborate?
Since we don't intend to wrap the |
The wrong except block is triggered when a timeout occurs during wait_for_selector.
Note where the all-caps words “EXCEPTION” and “TIMEOUT” appear in the code.
Search for “TIMEOUT” in the log—it is missing! Instead, “EXCEPTION” appears.
A timeout should be handled by:
However, it is incorrectly caught by:
code
log
The text was updated successfully, but these errors were encountered: