Skip to main content

In our ongoing mission to advance AI-generated content detection, we’re excited to introduce Model 4.0, codenamed “Curia”. This release marks a significant leap forward in our commitment to transparency, precision, and continuous improvement in identifying both human-written and AI-generated texts.


Introduction

The pace at which AI-generated content is evolving is unprecedented. With rapid advancements in generative models, the challenge to accurately detect and differentiate between human and AI-created texts has grown equally fast. In this landscape, robust and transparent detection mechanisms are essential.

Today, we proudly unveil Model 4.0 (“Curia”), built on the foundation of our previous successes and designed with enhanced accuracy and transparency. In this post, we outline our methodology, present detailed performance metrics, and reinforce our commitment to openness in AI content detection. Notably, while v4.0 shows a slightly lower overall AI accuracy compared to earlier iterations in one metric, it delivers a more balanced performance on classification tasks and achieves a significantly improved R² in regression tasks.


Commitment to Transparency in AI Detection

Full Disclosure

At the heart of our development process is a commitment to complete transparency. We openly share our accuracy rates, testing methodologies, and the intricacies of our datasets to set a new industry standard. With every release, our goal is to provide clear, data-backed insights into our model’s performance.

Dataset Overview

Key dataset details include:

  • Total Samples: 10,000
  • Language: English
  • Generation Date: 2025-02-05 11:23:26

This diverse and meticulously vetted dataset forms the backbone of our rigorous evaluation process.


Materials and Methodology

Data Collection

Our dataset comprises a wide range of human-written texts gathered from reputable sources, ensuring a rich and varied linguistic base. Each sample was selected to cover diverse writing styles and contexts, which is essential for robust detection.

AI-Generated Content and LLM Testing

For generating AI texts, we employed advanced generative models to create samples that closely mimic real-world AI outputs. Importantly, Model 4.0 (“Curia”) was both trained on and tested using outputs from a variety of leading large language models (LLMs), including:

  • Claude 1
  • Claude 2
  • Claude 3 opus
  • Claude Sonnet 3.5
  • Gpt 3.5 turbo
  • Gpt-4
  • Gpt-4o
  • Gpt-4o mini
  • Mistral Nemo
  • Gemini 1.5 Flash
  • Gemini 1.5 Pro
  • Llama 3.2B

This comprehensive approach ensures that our detection capabilities are robust and applicable across a diverse spectrum of AI-generated content.

Data Validation

To maintain the integrity of our evaluation, we rigorously validated the dataset through:

  • Exclusion of Training Data: Ensuring that none of the testing samples were part of the training phase.
  • Quality Assurance: Combining manual and automated checks to verify the authenticity and consistency of each sample.

Evaluation Metrics

We evaluated Model 4.0 (“Curia”) using a comprehensive set of metrics that assess both classification and regression performance.

Classification Metrics

These metrics help us determine how well the model categorizes texts into discrete classes (e.g., AI-generated vs. human-written). The key classification metrics include:

  • Accuracy
  • Precision
  • Recall
  • F1 Score

Regression Metrics

In addition to classification, our evaluation includes regression analysis. In our specific application, regression is used to detect the quantity of AI text present within a given text. This involves predicting a continuous numerical score that reflects the proportion or extent of AI-generated content, rather than merely classifying a text as AI or human-generated.

To measure the performance of these continuous predictions, we use the following regression metrics:

  • Accuracy (within a defined error margin of 0.1)
  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-squared (R²)

The error margin of 0.1 defines the acceptable range of deviation, ensuring our regression predictions are both precise and reliable.


Results and Analysis

Overall Performance

Model 4.0 (“Curia”) demonstrates exceptional performance across both classification and regression tasks:

MetricValue
Classification Overall Accuracy99.95%
R-squared (R²)99.08%

Detailed Metrics

Regression Metrics

MetricValue
R-squared (R²)0.9908
Mean Absolute Error (MAE)0.0120
Mean Squared Error (MSE)0.0006
Root Mean Squared Error (RMSE)0.0241

Classification Metrics

MetricValue
Overall Precision0.9993
Overall Recall0.9998
Overall F1 Score0.9995
AI Detection Accuracy0.999263
Human Detection Accuracy0.9997

Enhanced Prediction Mapping

In response to customer feedback, we have refined our prediction mapping system. Our new color-coding scheme for per-sentence predictions is now much closer to the global score. This improvement resolves previous discrepancies, ensuring that the per-sentence predictions accurately reflect the overall assessment of the quantity of AI-generated text—a key concern raised by some customers in the past.


Version Comparison

Our journey of continuous improvement can be clearly seen when comparing Model 4.0 (“Curia”) with its predecessors. Below is a summary table highlighting the classification performance of our recent versions:

VersionAI AccuracyHuman AccuracyOverall Score
2.099.6%98.4%99.0%
3.0 “Luka”99.98%99.5%99.74%
4.0 “Curia”99.92%99.97%99.95%

While v4.0 (“Curia”) shows a slightly lower AI accuracy compared to v3.0 “Luka” (99.93% vs. 99.98%), it compensates with significantly higher human text detection accuracy (99.98% vs. 99.5%) and a more balanced overall score (99.95% vs. 99.74%). Moreover, Curia introduces a major leap in regression performance with an R² of 0.9908, enabling it to accurately quantify the quantity of AI text within a given document. This balanced performance across multiple metrics marks a key advancement over previous iterations.


Conclusion

Model 4.0 (“Curia”) represents our most advanced effort to date in AI content detection. With its high classification accuracy, robust regression performance in quantifying AI text, and refined prediction mapping, Curia sets a new benchmark for the industry. We remain dedicated to continuous improvement and transparency in our technological endeavors.

Future Outlook

Looking ahead, our focus will be on:

  • Further Enhancements: Continuously refining detection capabilities.
  • Expanding Datasets: Integrating even more diverse and challenging texts.
  • Community Engagement: Incorporating community feedback and maintaining transparency to drive future innovations.

FAQ

Q: What is Model 4.0 (“Curia”)?
A: Curia is our latest AI detection model, designed to accurately distinguish between AI-generated and human-written texts with unprecedented precision.

Q: How was the dataset for testing curated?
A: The dataset, comprising 10,000 samples, includes both human-written and AI-generated texts. It has been carefully vetted and excludes any training data used during model development.

Q: Which LLMs were involved in training and testing?
A: Our model has been trained on and tested using outputs from a wide range of LLMs including Claude 1, Claude 2, Claude 3 opus, Claude Sonnet 3.5, Gpt 3.5 turbo, Gpt-4, Gpt-4o, GPT-4o mini, Mistral Nemo, Gemini 1.5 Flash, Gemini 1.5 Pro, and Llama 3.2B.

Q: What do the regression metrics indicate, and what is regression in this context?
A: Regression is a statistical method used to predict continuous numerical values. In our application, regression is specifically employed to detect the quantity of AI text within a given text. The regression metrics—Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²)—measure how accurately our model predicts this quantity. The improved R² value of 0.9823 indicates that our predictions closely match the actual proportion of AI-generated content.

Q: How does Curia compare to previous models?
A: Compared to earlier versions, Curia exhibits a slightly lower AI accuracy than v3.0 “Luka” but achieves a more balanced classification performance with a significantly higher human text detection accuracy and overall score. Additionally, its enhanced regression capabilities for quantifying the AI content make it a robust and reliable tool for content detection.

Q: What future developments can we expect?
A: We are committed to continuous innovation. Future updates will focus on further fine-tuning detection capabilities, expanding our datasets, and incorporating user feedback to drive improvements.

Thierry Lavergne

Co-Founder and Chief Technology Officer of Winston AI. With a career spanning over 15 years in software development, I specialize in Artificial Intelligence and deep learning. At Winston AI, I lead the technological vision, focusing on developing innovative AI detection solutions. My prior experience includes building software solutions for businesses of all sizes, and I am passionate about pushing the boundaries of AI technology. I love to write about everything related to AI and technology.