Update user-agent string to a descriptive one for the tool#242
Conversation
|
Thanks @TechnologyClassroom - I'm going to run a couple of local tests as well - if all goes well, I'll merge over the next couple of days. |
goneall
left a comment
There was a problem hiding this comment.
with this change, an additional 64 links are recorded as live - looks like it works as intended.
|
That's a great result! |
|
Thanks for the fix! 🙏 The website still shows unavailable links eg https://spdx.org/licenses/AGPL-3.0-only.html. I guess some update still needs to run 🤷 |
It takes a few days for the automated bans to go away. The old CI/CD will have to stop for a few days before this will work. |
|
It will be updated on the website on the next release of the license list. I created this issue to track the update: spdx/license-list-XML#2982 |
This should fix #220 at least from the sites that I admin. Instead of tricking the server into thinking LicenseListPublisher is a browser, this would clearly identify the tool and lead to this issue tracker if admins run into a problem.
Reasoning: imperva recommends blocking all Chrome user-agents more than 3 years old on page 34 of their 2025 Bad Bot Report and based on the data I am seeing that is sound advice. AI startups with botnets running broken vibe-coded crawlers use Chrome user-agents with randomized version numbers. This tool would either need to continually update the version number every 2-3 years or change strategy like this pull request suggests continue scraping.