AI Model Collapse

AI Model Collapse

by Timothy Coleman - 07/Dec/2024

If you have been following AI for any length of time, you might have come across a concept known as:

“Model collapse” or “Synthetic Data Feedback Loop.”

Let's look at a relatively harmless example in AI image generation...

A Simple Analogy

Before AI, there were A LOT of picture of cats on the internet.

AI comes along, and gathers pictures of said cats and 'learns what a cat looks like' from these images.

Image generation AI allows us to create new pictures of cats, and people make LOTS of picture of cats.

One day, there are more pictures of AI generated cats on the internet than there are of real cats.

AI continues to gather new picture of cats so it can continue to get understand what a cat looks like, so it can generate new (improved?) images of cats.

The issue here is not that AI is learning how to make better images of cats, the issue is that we are training AI image generation tools on AI generated images (as there are/will eventually be more AI generated images of cats out there than there are of real cats).

Implications for Software Development

Ok so, lets take this thinking and apply it to software development.

(Disclaimer: I'm a coder and I work with teams who write code, and developers out there are all using AI to help with their work).

Before AI, developers relied on help from the wider developer community in order to figure out how to integrate new tools and features.

E.g. a client wants a PayPal integration into their Website.

Developer might read the official documentation from Paypal, but they are probably using a set of tools and libraries which have their own way of working.

Developers would traditionally visit a site like StackOverflow or read some blog tutorial from another developer who has managed to solve common issues related to this requirement.

These days, I'm finding I'm asking less questions on developer forums (and posting less answers), because it is easier to use ChatGPT.

I'm not sure if this is a false correlation or not, but I'm also finding that THERE ARE LESS TURTORIALS OUT THERE ON HOW TO INTEGRATE NEW VERSIONS OF SOFTWARE TOOLS e.g. a new SDK offered by PayPal.

I'm also finding that ChatGPT will often give outdated information based on older versions of libraries.

Where is the issue?

A lot of developers using ChatGPT for helping write code, will just copy and paste code snippets generated by AI, especially as pressure mounts from business teams eager to speed up workflows and increase profits: "Just hurry up, use ChatGPT, we need this by this evening."

As a result of this pressure on developers, they might revert to using older versions of code without recent security updates (because ChatGPT is only producing this).

As a result of having less time, developers might not write the traditional blog posts, or StackOverflow questions, which results in.. as with the cats.. less new content for AI to learn from.

Conclusion

AI being trained on AI generated images of cats, might not be an issue, but AI being trained on AI generated code snippets, with no new training data out there, will lead to a world where software projects lag severely behind updated code libraries, and the use of exponentially outdated products.