Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate http clients from crawlers/bots #374

Open
srstsavage opened this issue Oct 4, 2024 · 2 comments
Open

Disambiguate http clients from crawlers/bots #374

srstsavage opened this issue Oct 4, 2024 · 2 comments

Comments

@srstsavage
Copy link
Contributor

srstsavage commented Oct 4, 2024

I was surprised to find http clients like python-requests, Go-http-client, wget, curl, etc included in the crawler list. While I understand that these tools can be abused, in our case a large portion of our legitimate web traffic is from API requests using http clients like these.

For now I think I'll need to create an overriding allow list of patterns and remove matches from agents.Crawlers before processing, but it would be great to be able to disambiguate client tools/libraries based on a field in crawler-user-agents.json. Maybe just an is_client boolean, or a more generic tags string array which could contain client or similar? Any thoughts?

@srstsavage
Copy link
Contributor Author

I'm sure I missed a few but looks like the list isn't too long

aiohttp
Apache-HttpClient
^curl
Go-http-client
http_get
httpx
libwww-perl
node-fetch
okhttp
python-requests
Python-urllib
[wW]get

@monperrus
Copy link
Owner

Completely see your point. I like the idea of having optional tags:

"tags": ["generic-client"]

Would you do a pull-request? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants