Website Anti-Scraping Upgrades: Combating the Surge in LLM Training Data Bots

As a flood of web scrapers descended on sites to collect LLM training data in early 2025, website administrators have been forced to strengthen their anti-scraping measures. These bots predominantly use outdated browser user agents, particularly Chrome versions, placing immense strain on website servers. The article details how the author identifies and blocks these scrapers by detecting suspicious browser versions, specifically highlighting issues with archival sites like archive.* that employ fake user agents and IP addresses. The author recommends using the more standardized archival service, archive.org. The piece reveals the real-world impact of AI training data collection on website operations, offering the tech community frontline experience in dealing with LLM training data scrapers.

Original Link:Hacker News

C code80.ai · AI 编码 API 聚合 Claude / GPT 多模型统一接入,稳定不限速,按量计费,几行配置接入 Claude Code。 了解一下 ›

抢沙发

评论前必须登录!

立即登录   注册