-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix site]: WSJ gets a captcha since January; no links #486
Comments
I'm aware of this bug, but I haven't even begun to think about how to fix it. I'm open to patches, and connex to people at WSJ to consult. |
Fair enough. Curious why you're gathering the pages yourself, rather than getting them from the Internet Archive (which appears to have circumvented the WSJ's limitations.) |
It's probably a longer story than anyone wants to hear. I launched the site in 2012 independent of archive.org as a self-hosted service funded by a Kickstarter campaign. At that time the Wayback Machine was not archiving the homepages of major sites with much frequency. There have been several evolutions since, with the current one hosting assets for free with IA's generous "collections" system. It would be possible to re-engineer the site to act as a supplement to Wayback's page captures. And perhaps I should move towards such a system. In the 12 years since I started, IA has ramped up its capturing rate for big sites. Though that's not always the case for lower trafficked sites. |
No, that's very useful context, thank you! It feels like the idea I'd be most open to implementing: scraping IA, would be a larger re-engineering of the project that I can't really take on. |
Just flagging that Reuters has the same issue as of 2023-12-04 03:59:00 |
Screenshot
Screenshot via https://palewi.re/docs/news-homepages/sites/wsj.html. AFAIK this has been going since 2024-01-16 16:11:00 (based on the last non-empty links JSON.) I wonder if this is something that could be rectifying (and backfilled) by scraping Internet Archive screenshots, or by asking WSJ to allowlist you. I also wonder if the captcha is part of WSJ's anti-AI-training-data-scraping efforts.
Have you circumvented captchas in other sites?
retaining this boilerplate from the template :)
The text was updated successfully, but these errors were encountered: