Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The context.page.wait_for_selector() method does not properly raise a timeout. #976

Closed
matecsaj opened this issue Feb 11, 2025 · 3 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

The wrong except block is triggered when a timeout occurs during wait_for_selector.

Note where the all-caps words “EXCEPTION” and “TIMEOUT” appear in the code.
Search for “TIMEOUT” in the log—it is missing! Instead, “EXCEPTION” appears.

A timeout should be handled by:

except TimeoutError as e:

However, it is incorrectly caught by:

except Exception as e:

code

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        try:
            # await context.page.wait_for_load_state("networkidle")
            # await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.locator("label:text('PinsideID')").wait_for(state="visible")
            await context.page.wait_for_selector("div.grouped-label label:text('PinsideID')")
            data = pinside_extract_data(await context.page.content(), context.request.url, context.response.url)
        except TimeoutError as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} TIMEOUT {e} PROXY {context.proxy_info}")
        except ValueError as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} VALUE {e} PROXY {context.proxy_info}" )
        except Exception as e:
            await context.add_requests([Request.from_url(url=context.request.url, always_enqueue=True)])
            context.log.warning(f"{context.request.url} EXCEPTION {type(e).__name__}: {e} PROXY {context.proxy_info}")
        else:
            context.log.info(f"{context.request.url} PROCESSED as {context.response.url}")
            await context.push_data(data)

log

[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/7 PROCESSED as https://pinside.com/pinball/machine/agents-777
[crawlee.crawlers._playwright._playwright_crawler] WARN  https://pinside.com/pinball/machine/5 EXCEPTION TimeoutError: Page.wait_for_selector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator("div.grouped-label label:text('PinsideID')") to be visible
 PROXY ProxyInfo(url='http://52.170.253.201:3128', scheme='http', hostname='52.170.253.201', port=3128, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/11 PROCESSED as https://pinside.com/pinball/machine/ali
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 11, 2025
@Pijukatel
Copy link
Contributor

Hi, could you please double check that you are importing correct TimeoutError? It is not "normal" Python TimeoutError, it should be Playwright TimeoutError .

As far as I can see, their TimeoutError does not inherit from the normal Python one, so I am guessing that is the problem?

@matecsaj
Copy link
Contributor Author

Thanks for the suggestion—I confirmed your theory. The evidence is below.

Observation

Crawlee raises a mix of standard Python exceptions and third-party library exceptions.

Consequences
1. Users must learn the underlying libraries to properly handle exceptions in their code.
2. Maintainers must consistently use the same third-party libraries to avoid breaking user code.

Does this present an opportunity to make Crawlee easier to use and maintain, or did you just teach me something new about Playwright? I’m unsure whether to close this issue, so I’ll leave that decision to Crawlee’s designer.

code

# Standard imports
from typing import Optional, override

# 3rd party imports
from camoufox import AsyncNewBrowser
from crawlee.browsers import (
    BrowserPool,
    PlaywrightBrowserController,
    PlaywrightBrowserPlugin,
)
from crawlee import Request
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.request_loaders import RequestList
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError

# Local imports
from src.wat_crawlee.pinside_common import pinside_get_start_urls, pinside_extract_data
from src.wat_crawlee.proxies import get_proxies_monosans


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    """A browser plugin that uses Camoufox browser, but otherwise keeps the functionality of
    PlaywrightBrowserPlugin."""

    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError("Playwright browser plugin is not initialized.")

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(
                self._playwright, **self._browser_launch_options
            ),
            max_open_pages_per_browser=1,  # Increase, if camoufox can handle it in your use case.
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.
            use_incognito_pages=True,
        )


async def crawl_pinside(
    max_requests_per_crawl: Optional[int],
) -> None:

    # PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=max_requests_per_crawl,
        browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
        proxy_configuration=ProxyConfiguration(proxy_urls=await get_proxies_monosans()),
        request_manager=await RequestList(pinside_get_start_urls()).to_tandem(),
        use_session_pool=False,  # don't use sessions, start every fetch fresh
    )

    # Define a request handler to process each crawled page and attach it to the crawler using a decorator.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        try:
            # await context.page.wait_for_load_state("networkidle")
            # await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.locator("label:text('PinsideID')").wait_for(state="visible")
            await context.page.wait_for_selector(
                "div.grouped-label label:text('PinsideID')"
            )
            data = pinside_extract_data(
                await context.page.content(), context.request.url, context.response.url
            )
        except TimeoutError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} TIMEOUT {e} PROXY {context.proxy_info}"
            )
        except PlaywrightTimeoutError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} PLAYWRIGHT_TIMEOUT {e} PROXY {context.proxy_info}"
            )
        except ValueError as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} VALUE {e} PROXY {context.proxy_info}"
            )
        except Exception as e:
            await context.add_requests(
                [Request.from_url(url=context.request.url, always_enqueue=True)]
            )
            context.log.warning(
                f"{context.request.url} EXCEPTION {type(e).__name__}: {e} PROXY {context.proxy_info}"
            )
        else:
            context.log.info(
                f"{context.request.url} PROCESSED as {context.response.url}"
            )
            await context.push_data(data)

    # Start the crawl.
    await crawler.run()

    # Export the entire dataset to a JSON file.
    await crawler.export_data("pinside.json")

log

[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/8 PROCESSED as https://pinside.com/pinball/machine/airborne-avenger
[crawlee.crawlers._playwright._playwright_crawler] WARN  https://pinside.com/pinball/machine/5 PLAYWRIGHT_TIMEOUT Page.wait_for_selector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator("div.grouped-label label:text('PinsideID')") to be visible
 PROXY ProxyInfo(url='http://98.8.195.160:443', scheme='http', hostname='98.8.195.160', port=443, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  https://pinside.com/pinball/machine/3 PROCESSED as https://pinside.com/pinball/machine/addams-family-gold-special-collectors-edition

@janbuchar
Copy link
Collaborator

Observation

Crawlee raises a mix of standard Python exceptions and third-party library exceptions.

That is not exactly the case - Crawlee does not raise that timeout error - in this case, it only acts as a wrapper around Playwright. In the request handler, you call functions on a Page instance owned by Playwright and that raises Playwright errors.

Consequences 1. Users must learn the underlying libraries to properly handle exceptions in their code.

Yes, to work with Playwright, you need to understand it. Same goes for... anything, I guess.

  1. Maintainers must consistently use the same third-party libraries to avoid breaking user code.

I'm not sure I follow - care to elaborate?

Does this present an opportunity to make Crawlee easier to use and maintain, or did you just teach me something new about Playwright? I’m unsure whether to close this issue, so I’ll leave that decision to Crawlee’s designer.

Since we don't intend to wrap the Page class to wrap exceptions with ours or standard ones (which would require some work and might confuse users who expect it to behave like it does now), I'm closing this. But let us know if you have any other questions or insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants