EU orders AI companies to clean up their act, stop using pirated data

7 hours ago 1

The European Commission wants AI companies to stop using pirated data and allow creators to withhold their copyrighted material. This comes amid the rise of a massive global workforce of remote workers from poor countries, who provide bespoke data via third party brokers. We take a closer look in this edition of Tech 24.

On Thursday, the European Commission released a highly anticipated set of guidelines for companies developing advanced artificial intelligence chatbots to do so in a responsible way.

The General-Purpose AI Code of Practice is voluntarily, but seen as a handbook for companies to abide by the EU's landmark regulation, the AI Act.

The guidelines cover AI safety, copyright and transparency, and apply to the companies making advanced, generalist AI apps like ChatGPT, Claude, Gemini and Le Chat, developed by OpenAI, Anthropic, Google and Mistral respectively.

Tech lobbies say it goes too far, and civil society groups say it's been watered down by the very same tech lobbies.

Industry lobby CCIA Europe said in a statement that the Code of Practice "imposes a disproportionate burden on AI providers".

Meanwhile, The Future Society think tank said the guide "means that potentially dangerous models get to European users without receiving any meaningful scrutiny from the AI Office (responsible for enforcing the AI Act)."

The Future Society argues that the EU wants to look innovation-friendly and not annoy US President Donald Trump, who's criticised the AI Act. They say tech lobbyists were given exclusive access to the final version of the Code of Practice, because the EU was keen to make sure that as many of these companies as possible sign up to the Code.

Indeed, the European Commission might well argue there's not much point in a Code of Practice if no one signs up to it.

Data drama

Data is the lifeblood of these AI models. What you feed into them is crucial for how they work. Up to now, most AI companies don't make it very clear what data they're using, and how.

This is about to change.

Signatories will have to report on their training method and data – how they got their data, what kind of data it is and evidence of how they obtained the rights to third-party data.

They'll also need to let independent external evaluators inspect their models, including letting them look at relevant training data.

One particularly thorny issue is copyrighted data.

Web crawlers have picked up everything online – even copyrighted content – and fed it into the machine, and many artists and authors feel their work has been stolen for profit.

Recent court documents show Anthropic has attempted to compile a library of every book in the world, going so far as to order second hand books en masse and scan them page by page.

That's not all: vast swathes of deliberately pirated material have also been used to train these models.

The Code of Practice asks companies for the first time to commit not to using databases of pirated content to train their models, and asks them to allow rights holders to opt out of their work being used for training.

This comes hot on the heels of some important rulings in the United States, which provide the first pieces of legal precedent on the use of copyrighted material.

Three decisions from judges in California in the last few weeks have tended towards the use of copyrighted material to train models being "fair use," without giving a free pass on pirated content. But given that copyright trials are by nature case-by-case, expect a lot more lawsuits still to come.

Top dollar for top data

Meanwhile, the market for high-quality, proprietary data has exploded overnight.

This is not just to avoid copyright lawsuits, but also for AI companies to push their models to be competitive and give the best answers possible.

Last month, Meta invested more than $14 billion in Scale AI. The startup provides bespoke training data to several AI companies, but the deal has some of them flocking to Scale AI's rivals.

Turing is one of these data providers. CEO Jonathan Siddharth told FRANCE 24 that business has been booming in the last couple of weeks.

His business model is based on millions of freelance software engineers and experts in poor countries –  a massive new gig economy for the AI age.

"Decent pay, no micromanagement, flexible hours," said one of these workers from India, who was recently laid off and to whom we granted anonymity. He added: "Only problem was zero job security, no paid leaves, often not enough work which led to less pay."

We asked Siddharth about the data and piracy issue earlier this week.

"Our clients pay for data which they basically own, which is different from scraping content from the internet," he said. "I do think we have to figure out new models. What does the world look like when you are creating content that would be ingested by an agent or an LLM in the future? I think we're still figuring that out."

Read Entire Article






<