Is Copyright's Protection Gone?

Over the past couple of years, there has been an explosion in AI, particularly in the form of LLMs such as GPT and Gemini, which have spread to nearly every aspect of our digital lives. However, as these applications multiply, so does an underlying concern: the increasing risk of copyright infringement.
We're witnessing a deluge of lawsuits against AI providers and developers, spanning everything from news content to media giants like YouTube. Remember the viral moment when the OpenAI top management couldn't definitively say whether GPT was trained on YouTube data? That hesitation encapsulates the murky waters we're navigating perfectly. This isn't just theory; it's a real-time debate, and I believe this discussion will continue indefinitely.
A Familiar Playbook: The Social Media Parallel
To be honest, this whole situation gives me a profound sense of déjà vu. It strongly reminds me of an earlier historical moment we all lived through with social media platforms like Facebook and Twitter.
Think about it: these platforms started as open spaces, allowing users to share anything and everything. Over time, as content grew and controversies mounted, so did the need for "censorship" or content moderation. Yet, from a commercial perspective, these platforms often indirectly monetized the very "infringements" or prohibited actions they claimed to fight. A controversial post, even if later removed, often went viral, generated immense attention, and drove traffic – which means ad revenue. What enters the internet, once it's out there, tends to stay there forever, regardless of a platform's attempts to delete it.
AI's Parallel Predicament
And frankly, I believe we are facing the exact same problem with Large Language Models today. From a 10,000-foot perspective, the patterns and similarities between social media platforms and how LLMs are developed, upgraded, and deployed are striking.
One key parallel is the "platform" concept itself. On social media, users share content that goes public instantly. With LLMs, it's different because the output doesn't necessarily go public in the same way. However, these models are trained from the vast amounts of data users share with them, directly or indirectly.
This brings us to the core dilemma: why can't AI providers simply "censor" the training data? What kind of data are we even talking about? We're talking about copyright-protected material: news articles, books, media, music, and more.
In theory, censoring this data sounds obvious. In reality, it's almost impossible. How can an LLM differentiate between:
- Authentic, original content created by the user?
- Copyrighted content a user is lawfully interacting with for private consumption, analysis, or where they hold a license (e.g., they bought the ebook, streamed the song, or have permission to analyze a video)?
The LLM doesn't really differentiate. It processes whatever it's fed, not just by one user, but by millions across the globe.