Skip to main content

Introduction

Generative AI is extremely easy to use, and a growing portion of the content published nowadays is written mostly by ChatGPT. 

If you’ve used any generative AI Chatbots, you’ll agree: it can do most of the heavy lifting quickly, and pretty accurately. However, this creates several problems with the next generation of writers, the next updates of major LLM’s and many more issues.

Chatbots produce average quality content with false information

Will anyone know how to write without the assistance of Chatbots? The problem with this is that ChatGPT and all other LLM’s do not check their sources and the accuracy of their information and outputs. As more of the AI generated content invades the web, so does the inaccurate information they produce. The final domino effect of this is that the next generation of LLM’s are trained on a mix of human data and synthetic data containing low quality content and inaccurate facts, mostly referred to as “Hallucinations”.

LLM’s need to scrape the Internet to train their new models and updates

Every major AI company, including Open AI(ChatGPT), Anthropic and even Google need to scrape the Internet to train their new model. This will become extremely problematic for a few reasons. The data available on the Internet is completely flooded with AI generated content. Furthermore, most of the platforms where humans interact are now blocking any type of scraping, or charging staggering amounts of money to use their API

Are AI chatbots stuck in 2021? 

ChatGPT is a victim of its own success. Now that the tool can write exceptionally well, its outputs have now flooded the Internet. To update itself on new content, it might have to also ingest its own synthetic content in the process, making its model less likely to mimic the way humans write. This is why many news outlets have reported that GPT4’s outputs are getting more and more predictable… and worse. 

Bottom Line

Due to their nature, Large Language Models will need huge amounts of quality data for its training. The quality and accuracy of these LLM’s is based on the data it was trained on in the first place. By feeding the next updates and generation of AI Chatbots with a mix of human and synthetic content, we expect the outputs produced by LLM’s will decrease until a solution is found. AI Detectors, like Winston AI, are here to help organizations filter through human made content and AI generated writing. 

FAQ

What is generative AI and how is it invading the web?

Generative AI refers to artificial intelligence systems like ChatGPT or Midjourney(for images) that can generate new content. The problem is that as generative AI becomes more widely used, more and more content online is being generated by AI rather than humans. This could “poison” the web with low-quality, inaccurate, or biased information since AI does not fact check or evaluate the truthfulness of the content it creates.

How can you tell if content was written by an AI?

You can use an AI content detector like Winston AI
There are also a few signs that content might be AI-generated:
– It has a generic, formulaic style without much originality. 
– It lacks the nuance, complexity, and variability characteristic of human writing.  
– It seems to summarize information from other sources rather than providing original analysis.
– The writing level is too consistent without the natural variation of human authors.

Why is AI-generated content a problem for training new AI models?

AI models are trained on large datasets of text, images, etc. If much of the data used for training is actually low-quality synthetic content from other AI’s, it pollutes the training data. This can propagate biases, false information, and the generic style of AI writing. New AI models will mimic these flaws rather than learning to generate high-quality, human-like content.

How can the AI community combat the poisoning of training data?

Some solutions include:
– Using an AI detector like Winston AI.
– Relying more on high-quality datasets verified to be human-created.
– Training models to identify and avoid regurgitating false information.
– Regularly testing models to spot any biases or inaccuracies creeping in.
– Using techniques like reinforcement learning to encourage more creative, variable output.

Will AI quality get worse over time due to the synthetic training data?

It’s definitely something that will likely happen if steps aren’t taken to verify training data and correct for any flaws creeping in. But with responsible data practices and innovations to make AI more robust, there are also opportunities to improve the quality, accuracy, and trustworthiness of AI systems over time. The AI community is taking this problem seriously.

Thierry Lavergne

Co-Founder and Chief Technology Officer of Winston AI. With a career spanning over 15 years in software development, I specialize in Artificial Intelligence and deep learning. At Winston AI, I lead the technological vision, focusing on developing innovative AI detection solutions. My prior experience includes building software solutions for businesses of all sizes, and I am passionate about pushing the boundaries of AI technology. I love to write about everything related to AI and technology.