Why is human downloading a file called pirating and AI scraping called training

13 points | by nutanc 6 hours ago

9 comments

general1465 2 hours ago
Buy some cheap computer like X99 with Xeon from AliExpress, add some cheap GPU like Tesla K80 and "train" your LLM models on it. Now you can pirate what you want and you are untouchable because every big AI company will give you lawyers for free of charge because if judge would decide against you, then the precedents would be against them as well.
ThrowawayR2 4 hours ago
There were a series of whitepapers commissioned by the FSF a while ago on Copilot when it was first released, one of which was "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" and its lead author was a professor of law. The analysis concluded that use of copyrighted works for training was legally defensible. The paper is here: https://www.fsf.org/licensing/copilot/copyright-implications...
chasing0entropy 6 hours ago
They are the same, however one has Money to defend the accusations.
[-]
- aebtebeten 6 hours ago
  Recall the "Golden Rule": those who have the gold make the rules.
dauertewigkeit 6 hours ago
Western politics is all about constructing these narratives that hide the hypocrisy and self-serving nature of the dominant political factions. You can see it everywhere, but this is one clear example of it.
markus_zhang 5 hours ago
Because ordinary people don’t make calls.
ben_w 5 hours ago
I do not take strong views of what "should" be, the following is merely my opinion on what "is".
The legal judgement in the case of Anthropic may answer your question, although with the caveat that I'm not a lawyer, that I have no legal training, and that I may be misreading what looks like plain language but which has an importantly different meaning in law.
The judgement is here: https://cases.justia.com/federal/district-courts/california/...
To quote parts of the section "overall analysis" (page 30):
```
  The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.
```
…
```
  The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use. Anthropic employees said copies of works (pirated ones, too) would be retained “forever” for “general purpose” even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic’s pocketbook and convenience.
```
In a way, this seems to be a repeat of the "The 'L' in 'ML' is 'learning'" argument:
You are not allowed to use the photocopier in the library to make a copy of the entire book. If your local library is anything like the ones I remember back in the UK, there's even a sign right next to the photocopier telling you this.
You are in fact allowed to go to a public library, learn things from the books within, and apply that knowledge without paying anything to any copyright holder. If/once you buy a book, likewise, because once it's been bought you don't owe the copyright holder anything for having learned something. This is the point of a library, of education, and indeed of copyright: the word is literally the right to make a copy, as in giving authors control over who has the right to make a copy, this is not the right to an eternal rent from what is learned by reading a copy.
(If you then over-train a model so it does print verbatim copies, this is bad for both legal and technical reasons: legal, because it's a copy; technical, because using a neural net to do a lossy compression of documents is a terrible waste of resources, which is just like humans in exactly the way that nobody has any interest in reproducing in silicon).
_wire_ 5 hours ago
Time to "train" on Marshall McLuhan:
See The Gutenberg Galaxy (book) (1962)
McLuhan's Wake (documentary movie, narrated by Laurie Anderson) (2002)
Re Wake: Listen to the accompanying full interviews with McLuhan's colleagues from which the documentary is drawn.
Nextgrid 5 hours ago
Same reason that when a person lies (sometimes even by omission) it's called "fraud" but when a company does it it's just business as usual, or at worst, a "mistake" resolved by employee training.