AI DOOMWhat Is ‘Model Collapse’? An Expert Explains the Rumors About an Impending AI Doom

By Aaron J. Snoswell

Published 21 August 2024

Artificial intelligence (AI) prophets and newsmongers are forecasting the end of the generative AI hype, with talk of an impending catastrophic “model collapse”. But how realistic are these predictions? And what is model collapse anyway?

Artificial intelligence (AIprophets and newsmongers are forecasting the end of the generative AI hype, with talk of an impending catastrophic “model collapse”.

But how realistic are these predictions? And what is model collapse anyway?

Discussed in 2023, but popularised more recently, “model collapse” refers to a hypothetical scenario where future AI systems get progressively dumber due to the increase of AI-generated data on the internet.

The Need for Data
Modern AI systems are built using machine learning. Programmers set up the underlying mathematical structure, but the actual “intelligence” comes from training the system to mimic patterns in data.

But not just any data. The current crop of generative AI systems needs high quality data, and lots of it.

To source this data, big tech companies such as OpenAI, Google, Meta and Nvidia continually scour the internet, scooping up terabytes of content to feed the machines. But since the advent of widely available and useful generative AI systems in 2022, people are increasingly uploading and sharing content that is made, in part or whole, by AI.

In 2023, researchers started wondering if they could get away with only relying on AI-created data for training, instead of human-generated data.

There are huge incentives to make this work. In addition to proliferating on the internet, AI-made content is much cheaper than human data to source. It also isn’t ethically and legally questionable to collect en masse.

However, researchers found that without high-quality human data, AI systems trained on AI-made data get dumber and dumber as each model learns from the previous one. It’s like a digital version of the problem of inbreeding.

This “regurgitive training” seems to lead to a reduction in the quality and diversity of model behaviour. Quality here roughly means some combination of being helpful, harmless and honest. Diversity refers to the variation in responses, and which people’s cultural and social perspectives are represented in the AI outputs.

In short: by using AI systems so much, we could be polluting the very data source we need to make them useful in the first place.

Avoiding Collapse
Can’t big tech just filter out AI-generated content? Not really. Tech companies already spend a lot of time and money cleaning and filtering the data they scrape, with one industry insider recently sharing they sometimes discard as much as 90% of the data they initially collect for training models.