You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the EnqueueStrategy.SAME_HOSTNAME I noticed it does not work properly on non www urls.
In the debugger I noticed it passes origin to the _check_enqueue_strategy but it uses the context.request.loaded_url if available.
So every URL that is checked will mismatch because of the difference in hostname
I tested this with multiple urls with & without www prefix and got the same behaviour.
Changing the line to origin = context.request.url fix this issue, but I have no idea what implications this would have on the other code.
I use the PlaywrightCrawler in my code with context.enqueue_links
The text was updated successfully, but these errors were encountered:
When using the
EnqueueStrategy.SAME_HOSTNAME
I noticed it does not work properly on non www urls.In the debugger I noticed it passes
origin
to the_check_enqueue_strategy
but it uses thecontext.request.loaded_url
if available.So every URL that is checked will mismatch because of the difference in hostname
I tested this with multiple urls with & without www prefix and got the same behaviour.
Changing the line to
origin = context.request.url
fix this issue, but I have no idea what implications this would have on the other code.I use the
PlaywrightCrawler
in my code withcontext.enqueue_links
The text was updated successfully, but these errors were encountered: