While copyright infringement from illegal downloading isn’t technically classified as “theft” under the law—since it falls under copyright infringement section—it still is theft in practice as you’re taking something of value without paying for it or having permission. Essentially, you’re depriving the creator of potential earnings, much like stealing physical goods. This article contains quite a lot of links for further sources to verify all the claims that are made here. You can of course google all the claims that are not sourced and they are backed up as well.
When it comes to training the models, that most likely falls into fair-use. The models learn little like a person learns from watching paintings in an art gallery or museum: as their aim is to adapt not replicate the data. But if you do not pay the fee to enter the museum in the first place you are stealing the experience. This is the case with Ai as “theft” has occured in the process of creating or aquiring a dataset that contains copyrighted data without proper licensing or permission from its owner. Also if the Ai outright replicates the copyrighted material it has then also could be argued to have committed copyright infringement (though in these cases the Ai training usually has failed, except for text).
Downloading images or other copyrighted data from internet is copyright infringement even if they are publicly available: https://ogc.harvard.edu/pages/copyright-and-fair-use
People also need to understand that when these companies (or you) download an image from publicly available internet, you are making a separate copy outside the medium that it has been licensed by the creator (the copy that resides in your RAM or web browsers cache memory).
Exception ofc being the fair-use, but that is rarely the case.
“fair use law allows someone or a company to use copyrighted material without consent as long as certain conditions are met – for instance, if it’s used for teaching or research or criticism or news reporting. You know, this law is intended to encourage freedom of expression, but there are real limits on it. For instance, the Supreme Court has said that if copyrighted material is used to make something new and that new thing competes with the original copyrighted work, that is not fair use.” –BOBBY ALLYN
Source: The Effect of the Use Upon the Potential Market
OpenAI is becoming a for profit company.
source: reuters.com/openai-remove-non-profit-control
What does this have to do with Ai companies?
Data needs to be locally accessible in order to train an Ai model. This means the copyrighted data must be downloaded.
cases where OpenAi is sued for illegally using copyrighted data:
- https://www.businessinsider.com/openai-lawsuit-copyrighted-data-train-chatgpt-court-tech-ai-news-2024-6
- https://www.johnpobrienesq.com/openai-sued-for-using-copyrighted-material-to-train-chatgpt/
- https://www.npr.org/2023/08/18/1194562272/openai-is-facing-lawsuits-over-copyrighted-materials-it-uses-to-train-chatgpt -Bobby Allyn link
OpenAi admitting wide use of copyrighted data:
https://www.theguardian.com/technology/2024/jan/08/ai-tools-chatgpt-copyrighted-material-openai
under this article also other Ai companies are getting sued:
“Getty Images, which owns one of the largest photo libraries in the world, is suing the creator of Stable Diffusion, Stability AI, in the US and in England and Wales for alleged copyright breaches. In the US, a group of music publishers including Universal Music are suing Anthropic, the Amazon-backed company behind the Claude chatbot, accusing it of misusing “innumerable” copyrighted song lyrics to train its model.”
Many Ai companies use data laundring to hide their copyright infringements:
- https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
- https://medium.com/discourse/is-big-tech-using-data-laundering-to-cheat-artists-ccf1a8c87b91
- https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/
Some of the datasets only contain links to the data. But in order to verify the links they have needed to download and process the copyrighted material. Some of them claim that only shortly downloading them is fair-use as they then delete them right away. But considering they often receive funding from parties that profit from these datasets this is just pure data laundering. More importantly: as Ai companies utilize these links they themselves then have to download the files.
More about scraping, Data Laundering, Lawsuits, and Generative Ai:
https://www.createdontscrape.com/
https://www.createdontscrape.com/pretrainingfine-tuning-why-you-need-to-know
Some people make the excuse that the individual images are not worth anything, and thus companies should not be required to compensate for the time and effort people have spent in creating these images. If these images would not be valuable to the companies then they would not bother to use them, but they do even at the risk of legal consequences.
There are Ai companies that seek ethical and correctly licensed datasets to train their models. But it would be ridiculously naive to think that the most succesfull models would not be illegaly or maliciously obtaining copyrighted works into their datasets in order to gain an advantage when it comes to model performance.
Most of the companies hide behind non-profits (OpenAi looking to restructrure to become a for profit company)
(Stable Diffusion sharing older models for free while making profit by selling access to newer models. Also laundering the profits to come from selling the infrastructure to run their models).
The problems for owners of copyrighted works
The problem with works being downloaded against their owners consent and laundered into Ai’s is quite the problem.
1: The people are not being compensated when their works are downloaded without proper licensing to create commercial projects against their will. These copyrighted works are a defacto requirement for the training of an Ai model that is commercially used to make profit in the future, and thus should be licenced accordingly. Downloading ie. aquiring copyrighted data for this use is not fair use, even if the training itself might be.
2: The works produced by Ai’s compete with the original pieces. When a person (even if a person would not be a customer for any original pieces) generates data from the copyrighted works and makes it available to others: They flood and saturate the market while also skewing the algorythm that would otherwise lead paying customers to the original owner of the copyrighted material.
Now the original owner and creator of the copyrighted work must compete with works that adapt and imitate their works. Their profit model that previously enabled others to enjoy their works for free (due to income coming from ads and other non-linear sources etc.) has now been destroyed by exploitation and they need to find another income model and a way to advertise.
These result in stagnation and loss of innovation as people no longer get compensated for their work, nor see a point doing it for free when it gets lost in the void.
These unethical methods of using copyrighted data without consent lead to short term gains with negative long-term effects for both humans and probably the whole Ai field.
Even if one might find loop holes from current legal frameworks, the practise is obviously unethical.
Just ask yourself: Why do you think people should not be compensated for their time that they spent making these works? Why do you think you or these companies should be entitled to financially benefit from other peoples work without any compensation to those people?
As Kantian ethics suggests, we have a duty to treat others’ contributions with respect and fairness. Therefore, bypassing rightful compensation not only undermines the creator’s rights but also erodes the moral foundation of trust and mutual respect upon which all fair societies are built.