"As a condition of accessing this website, you agree to abide by the following content signals ..."
which means robot operators are now better off if they never download the robots.txt file, because then you know for sure that you won't accidentally encounter these conditions.
This creates a legal risk if you try to adhere to the robots.txt, so it'll make future AI bots less controllable and less restricted.
That would be an interesting court case. I'm doubtful that companies will be held to agreements that they didn't even see and even their bots didn't explicitly agree to?
This isn't even like a shrinkwrap license where the bot would have to press the "I agree" button.
Cloudflare's other initiative where they help websites to block bots seems more likely to work.
On the other hand if an LLM sees this then maybe it will be confused by it.
If nothing else, it might provide a more consistent way to signal to the crawlers which do respect robots.txt. For example Google, Bing and Apple already offer ways to signal that your site approves of search indexing but not training, but they each require a different non-standard signal for the same thing.
For the crawlers that ignore robots.txt nothing changes of course, and for the ones which claim to support a training opt-out you just have to take them at their word that it actually does anything.
that definition of free was always about liberty, not cost! serving information costs money. requesters aren't entitled to the bandwidth costs!
the centralized walled gardens we have today are a direct result of people confusing "free" to mean "no cost" instead of "I can do whatever I want with it"
They also suggest you add this line to the robots.txt file:
> # ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS RESERVATIONS OF RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.
But that legal restriction only applies to Europeans, so this only prevents European AI companies from developing models on par with American and Chinese AI. Correct me if I'm wrong.
Also open-source AI models like Llama and DeepSeek will be trained on all websites and released for free anyway, and those models can be used by Europeans, so in practice this policy won't really keep your website out of AI use anyway.
Ultimately it's just serving to prevent Europeans from developing equally good AI models and AI search apps.
This is massively counterproductive!
They add to the robots.txt file:
"As a condition of accessing this website, you agree to abide by the following content signals ..."
which means robot operators are now better off if they never download the robots.txt file, because then you know for sure that you won't accidentally encounter these conditions.
This creates a legal risk if you try to adhere to the robots.txt, so it'll make future AI bots less controllable and less restricted.
That would be an interesting court case. I'm doubtful that companies will be held to agreements that they didn't even see and even their bots didn't explicitly agree to?
This isn't even like a shrinkwrap license where the bot would have to press the "I agree" button.
Cloudflare's other initiative where they help websites to block bots seems more likely to work.
On the other hand if an LLM sees this then maybe it will be confused by it.
It is my impression - feel free to correct me, I'm no lawyer - that in the USA this case was interesting, and has already happened: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
That is interesting, but the implications seem complicated. I wonder how a court would rule in a different case?
As this is a robots.txt, it's still not enforced. So how much good can this really do?
If nothing else, it might provide a more consistent way to signal to the crawlers which do respect robots.txt. For example Google, Bing and Apple already offer ways to signal that your site approves of search indexing but not training, but they each require a different non-standard signal for the same thing.
For the crawlers that ignore robots.txt nothing changes of course, and for the ones which claim to support a training opt-out you just have to take them at their word that it actually does anything.
Feels like if any bot doesn't respect it, then inevitably the data ends up in every bot's training data.
The web has fallen so far from "information wants to be free".
The good parts still exist.
They just don't grow as fast as the bad parts of the web.
But like email, we're perhaps drowning in spam.
"Information wants to be monetized."
that definition of free was always about liberty, not cost! serving information costs money. requesters aren't entitled to the bandwidth costs!
the centralized walled gardens we have today are a direct result of people confusing "free" to mean "no cost" instead of "I can do whatever I want with it"
They also suggest you add this line to the robots.txt file:
> # ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS RESERVATIONS OF RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.
But that legal restriction only applies to Europeans, so this only prevents European AI companies from developing models on par with American and Chinese AI. Correct me if I'm wrong.
Also open-source AI models like Llama and DeepSeek will be trained on all websites and released for free anyway, and those models can be used by Europeans, so in practice this policy won't really keep your website out of AI use anyway.
Ultimately it's just serving to prevent Europeans from developing equally good AI models and AI search apps.
If the EU had balls it would block American AI just as China told Google to fuck off.
It would also stop the brain drain.
[dead]
From the title, I thought this was about giving (end) users choice, not giving website owners the choice to restrict their users.
Not that it matters much anyway, since thankfully users always have the choice to just ignore it.
I’m really curious how this plays out when you start using an ai browser like comet.
[dead]
[dead]