The context.page.wait_for_selector() method does not properly raise a timeout. #976

matecsaj · 2025-02-11T22:42:13Z

The wrong except block is triggered when a timeout occurs during wait_for_selector.

Note where the all-caps words “EXCEPTION” and “TIMEOUT” appear in the code.
Search for “TIMEOUT” in the log—it is missing! Instead, “EXCEPTION” appears.

A timeout should be handled by:

except TimeoutError as e:

However, it is incorrectly caught by:

except Exception as e:

code

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        try:
            # await context.page.wait_for_load_state("networkidle")
            # await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.locator("label:text('PinsideID')").wait_for(state="visible")
            await context.page.wait_for_selector("div.grouped-label label:text('PinsideID')")
            data = pinside_extract_data(await context.page.content(), context.request.url, context.response.url)
        except TimeoutError as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} TIMEOUT {e} PROXY {context.proxy_info}")
        except ValueError as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} VALUE {e} PROXY {context.proxy_info}" )
        except Exception as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} EXCEPTION {type(e).__name__}: {e} PROXY {context.proxy_info}")
        else:
            context.log.info(f"{context.request.url} PROCESSED as {context.response.url}")
            await context.push_data(data)

log

[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/7 PROCESSED as https://pinside.com/pinball/machine/agents-777
[crawlee.crawlers._playwright._playwright_crawler] WARN  https://pinside.com/pinball/machine/5 EXCEPTION TimeoutError: Page.wait_for_selector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator("div.grouped-label label:text('PinsideID')") to be visible
 PROXY ProxyInfo(url='http://52.170.253.201:3128', scheme='http', hostname='52.170.253.201', port=3128, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/11 PROCESSED as https://pinside.com/pinball/machine/ali

Pijukatel · 2025-02-12T07:31:54Z

Hi, could you please double check that you are importing correct TimeoutError? It is not "normal" Python TimeoutError, it should be Playwright TimeoutError .

As far as I can see, their TimeoutError does not inherit from the normal Python one, so I am guessing that is the problem?

matecsaj · 2025-02-12T17:31:53Z

Thanks for the suggestion—I confirmed your theory. The evidence is below.

Observation

Crawlee raises a mix of standard Python exceptions and third-party library exceptions.

Consequences
1. Users must learn the underlying libraries to properly handle exceptions in their code.
2. Maintainers must consistently use the same third-party libraries to avoid breaking user code.

Does this present an opportunity to make Crawlee easier to use and maintain, or did you just teach me something new about Playwright? I’m unsure whether to close this issue, so I’ll leave that decision to Crawlee’s designer.

code

# Standard imports
from typing import Optional, override

# 3rd party imports
from camoufox import AsyncNewBrowser
from crawlee.browsers import (
    BrowserPool,
    PlaywrightBrowserController,
    PlaywrightBrowserPlugin,
)
from crawlee import Request
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.request_loaders import RequestList
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError

# Local imports
from src.wat_crawlee.pinside_common import pinside_get_start_urls, pinside_extract_data
from src.wat_crawlee.proxies import get_proxies_monosans


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    """A browser plugin that uses Camoufox browser, but otherwise keeps the functionality of
    PlaywrightBrowserPlugin."""

    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError("Playwright browser plugin is not initialized.")

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(
                self._playwright, **self._browser_launch_options
            ),
            max_open_pages_per_browser=1,  # Increase, if camoufox can handle it in your use case.
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.
            use_incognito_pages=True,
        )


async def crawl_pinside(
    max_requests_per_crawl: Optional[int],
) -> None:

    # PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=max_requests_per_crawl,
        browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
        proxy_configuration=ProxyConfiguration(proxy_urls=await get_proxies_monosans()),
        request_manager=await RequestList(pinside_get_start_urls()).to_tandem(),
        use_session_pool=False,  # don't use sessions, start every fetch fresh
    )

    # Define a request handler to process each crawled page and attach it to the crawler using a decorator.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        try:
            # await context.page.wait_for_load_state("networkidle")
            # await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.locator("label:text('PinsideID')").wait_for(state="visible")
            await context.page.wait_for_selector(
                "div.grouped-label label:text('PinsideID')"
            )
            data = pinside_extract_data(
                await context.page.content(), context.request.url, context.response.url
            )
        except TimeoutError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} TIMEOUT {e} PROXY {context.proxy_info}"
            )
        except PlaywrightTimeoutError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} PLAYWRIGHT_TIMEOUT {e} PROXY {context.proxy_info}"
            )
        except ValueError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} VALUE {e} PROXY {context.proxy_info}"
            )
        except Exception as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} EXCEPTION {type(e).__name__}: {e} PROXY {context.proxy_info}"
            )
        else:
            context.log.info(
                f"{context.request.url} PROCESSED as {context.response.url}"
            )
            await context.push_data(data)

    # Start the crawl.
    await crawler.run()

    # Export the entire dataset to a JSON file.
    await crawler.export_data("pinside.json")

log

[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/8 PROCESSED as https://pinside.com/pinball/machine/airborne-avenger
[crawlee.crawlers._playwright._playwright_crawler] WARN  https://pinside.com/pinball/machine/5 PLAYWRIGHT_TIMEOUT Page.wait_for_selector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator("div.grouped-label label:text('PinsideID')") to be visible
 PROXY ProxyInfo(url='http://98.8.195.160:443', scheme='http', hostname='98.8.195.160', port=443, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/3 PROCESSED as https://pinside.com/pinball/machine/addams-family-gold-special-collectors-edition

janbuchar · 2025-02-13T17:24:45Z

Observation

Crawlee raises a mix of standard Python exceptions and third-party library exceptions.

That is not exactly the case - Crawlee does not raise that timeout error - in this case, it only acts as a wrapper around Playwright. In the request handler, you call functions on a Page instance owned by Playwright and that raises Playwright errors.

Consequences 1. Users must learn the underlying libraries to properly handle exceptions in their code.

Yes, to work with Playwright, you need to understand it. Same goes for... anything, I guess.

Maintainers must consistently use the same third-party libraries to avoid breaking user code.

I'm not sure I follow - care to elaborate?

Does this present an opportunity to make Crawlee easier to use and maintain, or did you just teach me something new about Playwright? I’m unsure whether to close this issue, so I’ll leave that decision to Crawlee’s designer.

Since we don't intend to wrap the Page class to wrap exceptions with ours or standard ones (which would require some work and might confuse users who expect it to behave like it does now), I'm closing this. But let us know if you have any other questions or insights!

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 11, 2025

janbuchar closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The context.page.wait_for_selector() method does not properly raise a timeout. #976

The context.page.wait_for_selector() method does not properly raise a timeout. #976

matecsaj commented Feb 11, 2025

Pijukatel commented Feb 12, 2025

matecsaj commented Feb 12, 2025

janbuchar commented Feb 13, 2025

The context.page.wait_for_selector() method does not properly raise a timeout. #976

The context.page.wait_for_selector() method does not properly raise a timeout. #976

Comments

matecsaj commented Feb 11, 2025

Pijukatel commented Feb 12, 2025

matecsaj commented Feb 12, 2025

janbuchar commented Feb 13, 2025