As a flood of web scrapers descended on sites to collect LLM training data in early 2025, website administrators have been forced to strengthen their anti-scraping measures. These bots predominantly use outdated browser user agents, particularly Chrome versions, placing immense strain on website servers. The article details how the author identifies and blocks these scrapers by detecting suspicious browser versions, specifically highlighting issues with archival sites like archive.* that employ fake user agents and IP addresses. The author recommends using the more standardized archival service, archive.org. The piece reveals the real-world impact of AI training data collection on website operations, offering the tech community frontline experience in dealing with LLM training data scrapers.
Original Link:Hacker News









评论前必须登录!
立即登录 注册