OpenAI admits it’s impossible to train generative AI without copyrighted materials

by Shivendra Singh · January 9, 2024

OpenAI and its biggest backer, Microsoft, are facing several lawsuits accusing them of using other people’s copyrighted works without permission to train the former’s large language models (LLMs). And based on what OpenAI told the House of Lords Communications and Digital Select Committee, we might see more lawsuits against the companies in the future. It would be “impossible to train today’s leading AI models without using copyrighted materials,” OpenAI wrote in its written evidence (PDF) submission for the committee’s inquiry into LLMs, as first reported by the The Guardian.

The company explained that it’s because copyright today “covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents.” It added that “[l]imiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.” OpenAI also insisted that it complies with copyright laws when it trains its models. In a new post on its blog made in response to the The New York Times’ lawsuit, it said the use of publicly available internet …read more