It's such madness that all this resource usage is for LLMs that are barely useful (at least for me). So many billions in resources wasted for machine learning systems that mimic human speech, it's not intelligent in any sense.
Now we see models showing the user it's "thinking" as some sort of intelligent agent, when really, ploys like that is to prop up the AI sector's stock price.
Surely LLMs in this form can't be the future of AGI?
LLMs are much better at summarizing textual content or extracting specific pieces of information from it rather than answering complicated/niche queries from their weights alone, and that's likely (part of) what is happening here (i.e. fetching Wikipedia articles and cross-checking before answering).
Arguably this is using Wikipedia exactly for what it's designed for, although in an unexpectedly resource-intensive way. I bet just adding a web query cache for most frequently visited URLs on the side of the LLM provider could mitigate most of the negative effects here.
I really don't get why we got here. The big 5 providers have enough money to hire/poach some scraping/indexing experts. They could avoid lots of grief and issues by just indexing in a non stupid way. Even if they still ignored robots.txt, they could fly under the radar with a few well placed exceptions. But now they're risking losing easy access to a lot of the internet by just doing a really bad job of a thing they really rely on.
I get that it could be incompetence in one or two places, but as far as I know all the big providers fail here... so am I missing something obvious? Is it really just greed and vc money burning scrapefest?
I am also interested in this! I really want to know who is doing this scraping, and why are they doing it badly relative to search-engines? The argument that I've heard is that search engines have a vested interest in keeping sites up, whereas ML companies do not, but that seems a little too mechanistic -- some of the mistakes in load aren't good for the scrapers resources, as well as the scraped.
My working theory is that rather than the big AI companies, this is mainly a rash of Asmall I startups (globally) who don't know any better and are just writing poor tools, and also think that "unique data" gives them an advantage. So this is more about a flood of capital to (frankly) incompetent groups, that will empty out when the boom busts. But I can see the arguments on both sides. It would be great to know the facts!
These are presumably LLMs with access to web search answering user queries, not crawlers scraping data for training.
It's probably both cheaper and more accurate (for fast-changing content) to them to just hit Wikipedia servers every time than to special case search results pointing there and keeping a local copy.
They talk about how a lot of the problem is all these bot requests hitting cold pages and especially multimedia that are not cached because they are rarely accessed. So I don't think it is likely that this activity is in direct response to user queries. I don't know why using an LLM would make people any more interested in very boring and obscure geographic and historical trivia than they would be via search engines.
It's such madness that all this resource usage is for LLMs that are barely useful (at least for me). So many billions in resources wasted for machine learning systems that mimic human speech, it's not intelligent in any sense.
Now we see models showing the user it's "thinking" as some sort of intelligent agent, when really, ploys like that is to prop up the AI sector's stock price.
Surely LLMs in this form can't be the future of AGI?
LLMs are much better at summarizing textual content or extracting specific pieces of information from it rather than answering complicated/niche queries from their weights alone, and that's likely (part of) what is happening here (i.e. fetching Wikipedia articles and cross-checking before answering).
Arguably this is using Wikipedia exactly for what it's designed for, although in an unexpectedly resource-intensive way. I bet just adding a web query cache for most frequently visited URLs on the side of the LLM provider could mitigate most of the negative effects here.
I find LLMs to be very useful at doing things I could do myself, but don't really want to.
Like making charts out of CSV data, writing cute postcard messages to my girlfriend, or passive aggressive complaints to my landlord.
[dead]
I really don't get why we got here. The big 5 providers have enough money to hire/poach some scraping/indexing experts. They could avoid lots of grief and issues by just indexing in a non stupid way. Even if they still ignored robots.txt, they could fly under the radar with a few well placed exceptions. But now they're risking losing easy access to a lot of the internet by just doing a really bad job of a thing they really rely on.
I get that it could be incompetence in one or two places, but as far as I know all the big providers fail here... so am I missing something obvious? Is it really just greed and vc money burning scrapefest?
>The big 5 providers have enough money to hire/poach some scraping/indexing experts. [...]
>I get that it could be incompetence in one or two places, but as far as I know all the big providers fail here...
Is there any indication that the scraping traffic is from the "the big providers"? The article doesn't mention this.
I am also interested in this! I really want to know who is doing this scraping, and why are they doing it badly relative to search-engines? The argument that I've heard is that search engines have a vested interest in keeping sites up, whereas ML companies do not, but that seems a little too mechanistic -- some of the mistakes in load aren't good for the scrapers resources, as well as the scraped.
My working theory is that rather than the big AI companies, this is mainly a rash of Asmall I startups (globally) who don't know any better and are just writing poor tools, and also think that "unique data" gives them an advantage. So this is more about a flood of capital to (frankly) incompetent groups, that will empty out when the boom busts. But I can see the arguments on both sides. It would be great to know the facts!
Discussions:
(87 points, 1 day ago, 97 comments) https://news.ycombinator.com/item?id=43555898
(47 points, 1 day ago, 45 comments) https://news.ycombinator.com/item?id=43562005
Thanks!
Which companies are doing this? Why aren't they just downloading the Wikipedia databases as per https://en.wikipedia.org/wiki/Wikipedia:Database_download ?
From the Wikipedia OKRs linked from the Wikipedia article that is the source of the Ars article:
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_...
> Key result FA2.1: Generate 5,000,000 views from short-form video content across all owned channels by the end of H1.
Wikipedia is actively saying it should pivot to short-form video to engage young people...
Why would anyone scrape wikipedia, isn’t there a full archive available via torrent?
These are presumably LLMs with access to web search answering user queries, not crawlers scraping data for training.
It's probably both cheaper and more accurate (for fast-changing content) to them to just hit Wikipedia servers every time than to special case search results pointing there and keeping a local copy.
> Automated bots seeking AI model training data for LLMs
Second sentence in the article.
They talk about how a lot of the problem is all these bot requests hitting cold pages and especially multimedia that are not cached because they are rarely accessed. So I don't think it is likely that this activity is in direct response to user queries. I don't know why using an LLM would make people any more interested in very boring and obscure geographic and historical trivia than they would be via search engines.
Also discussed here: https://news.ycombinator.com/item?id=43555898
Eventually even things we took for granted like Wikipedia will be gone.
Welcome to the new dark ages.
[flagged]