AI in the dock for copyright violations
China Daily | Updated: 2024-01-17 08:15
The New York Times filed a lawsuit against OpenAI and Microsoft in December alleging that the companies illegally used millions of its articles to train its large language artificial intelligence models. To support its case, The New York Times provided over 100 examples where the output from ChatGPT was highly similar to its articles.
In response, OpenAI issued a statement on Jan 8 saying that using publicly available internet materials to train AI models was reasonable, and OpenAI provides the option to opt out. It suggested that AI's "copying" and regurgitation of the original text, as demonstrated by The New York Times in its lawsuit, was a deliberate manipulation of prompt words by the newspaper, including the use of lengthy summaries of articles, in order to have the models spit out entire parts of specific pieces of content or articles. Although it also said that such regurgitation "is a rare bug that we are working to drive to zero".
In a deeper sense, their disagreement is more about the ethics of the language learning models. AI companies such as OpenAI argue that the training of LLMs, which refer to large language models that can generate humanlike responses to natural language queries based on massive data sets, is fundamentally different from copying. They say the learning and training process for AI models should be understood in accordance with the growth mechanism of people. That is, learning public information, acquiring knowledge reserves, and developing and improving in the interaction with those it serves.
Media organizations such as The New York Times, on the other hand, as well as seeing the technology as a competitor and threat, believe that the LLMs are plagiaristic and violate media ethics.
Whatever the lawsuit's outcome is, it will not only set a precedent on whether companies developing LLMs have to pay high copyright fees for their data sources, but also decide which definition of LLMs will legally prevail.