AI bots strain Wikimedia as bandwidth surges 50%

(arstechnica.com)

43 points | by pabs3 17 hours ago

20 comments

botanical 12 hours ago
It's such madness that all this resource usage is for LLMs that are barely useful (at least for me). So many billions in resources wasted for machine learning systems that mimic human speech, it's not intelligent in any sense.
Now we see models showing the user it's "thinking" as some sort of intelligent agent, when really, ploys like that is to prop up the AI sector's stock price.
Surely LLMs in this form can't be the future of AGI?
[-]
- lxgr 4 hours ago
  LLMs are much better at summarizing textual content or extracting specific pieces of information from it rather than answering complicated/niche queries from their weights alone, and that's likely (part of) what is happening here (i.e. fetching Wikipedia articles and cross-checking before answering).
  Arguably this is using Wikipedia exactly for what it's designed for, although in an unexpectedly resource-intensive way. I bet just adding a web query cache for most frequently visited URLs on the side of the LLM provider could mitigate most of the negative effects here.
- spacebanana7 8 hours ago
  I find LLMs to be very useful at doing things I could do myself, but don't really want to.
  Like making charts out of CSV data, writing cute postcard messages to my girlfriend, or passive aggressive complaints to my landlord.
  [-]
  - totalkikedeath 7 hours ago
    [dead]
viraptor 11 hours ago
I really don't get why we got here. The big 5 providers have enough money to hire/poach some scraping/indexing experts. They could avoid lots of grief and issues by just indexing in a non stupid way. Even if they still ignored robots.txt, they could fly under the radar with a few well placed exceptions. But now they're risking losing easy access to a lot of the internet by just doing a really bad job of a thing they really rely on.
I get that it could be incompetence in one or two places, but as far as I know all the big providers fail here... so am I missing something obvious? Is it really just greed and vc money burning scrapefest?
[-]
- gruez 7 hours ago
  >The big 5 providers have enough money to hire/poach some scraping/indexing experts. [...]
  >I get that it could be incompetence in one or two places, but as far as I know all the big providers fail here...
  Is there any indication that the scraping traffic is from the "the big providers"? The article doesn't mention this.
  [-]
  - dannyobrien 5 hours ago
    I am also interested in this! I really want to know who is doing this scraping, and why are they doing it badly relative to search-engines? The argument that I've heard is that search engines have a vested interest in keeping sites up, whereas ML companies do not, but that seems a little too mechanistic -- some of the mistakes in load aren't good for the scrapers resources, as well as the scraped.
    My working theory is that rather than the big AI companies, this is mainly a rash of Asmall I startups (globally) who don't know any better and are just writing poor tools, and also think that "unique data" gives them an advantage. So this is more about a flood of capital to (frankly) incompetent groups, that will empty out when the boom busts. But I can see the arguments on both sides. It would be great to know the facts!
gnabgib 17 hours ago
Discussions:
(87 points, 1 day ago, 97 comments) https://news.ycombinator.com/item?id=43555898
(47 points, 1 day ago, 45 comments) https://news.ycombinator.com/item?id=43562005
[-]
- ViktorRay 13 hours ago
  Thanks!
scinadier 7 hours ago
Which companies are doing this? Why aren't they just downloading the Wikipedia databases as per https://en.wikipedia.org/wiki/Wikipedia:Database_download ?
[-]
- 4 hours ago
  [deleted]
philipwhiuk 10 hours ago
From the Wikipedia OKRs linked from the Wikipedia article that is the source of the Ars article:
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_...
> Key result FA2.1: Generate 5,000,000 views from short-form video content across all owned channels by the end of H1.
Wikipedia is actively saying it should pivot to short-form video to engage young people...
ricardobeat 5 hours ago
Why would anyone scrape wikipedia, isn’t there a full archive available via torrent?
[-]
- lxgr 4 hours ago
  These are presumably LLMs with access to web search answering user queries, not crawlers scraping data for training.
  It's probably both cheaper and more accurate (for fast-changing content) to them to just hit Wikipedia servers every time than to special case search results pointing there and keeping a local copy.
  [-]
  - ricardobeat 9 minutes ago
    > Automated bots seeking AI model training data for LLMs
    Second sentence in the article.
  - recursivecaveat 4 hours ago
    They talk about how a lot of the problem is all these bot requests hitting cold pages and especially multimedia that are not cached because they are rarely accessed. So I don't think it is likely that this activity is in direct response to user queries. I don't know why using an LLM would make people any more interested in very boring and obscure geographic and historical trivia than they would be via search engines.
lxgr 4 hours ago
Also discussed here: https://news.ycombinator.com/item?id=43555898
7 hours ago
[deleted]
kubb 14 hours ago
Eventually even things we took for granted like Wikipedia will be gone.
Welcome to the new dark ages.
[-]
- Vosporos 13 hours ago
  [flagged]