Testing AI Detection Tools – Our Methodology

Jonatan Parski CEO

Do you want to take your content to the next level? Our crew can help! Request a free sample of 5,000 words, or speak with a founder.


As a content creation agency, we immediately took an interest when AI-detection tools came onto the scene. However, we saw mixed results when using these tools. At the risk of falsely accusing our writers of delivering AI content, we decide to look more closely at the tools to determine their accuracy.

While there are many best-of lists for these tools, very few of them use independent or transparent testing to evaluate the true efficacy of each. As such, we set out to create our own universal and functional methodology to test AI detectors. We brainstormed the best ways to test these tools and came up with the following methodology.

Why Is It Essential To Test AI Detectors Before Use?

If you’re considering using an AI detector for your business, you want to ensure that you’re using the best one for your needs, right? That’s the main reason why it’s important to test AI detectors before you start using them. 

At the end of the day, you don’t want to falsely accuse someone of delivering AI content because the detector you used was unreliable. With so many different tools all saying they’re the best and most accurate, it can become overwhelming to know how to test these tools and discover the truth. 

Testing AI detectors is essential because there are so many different training models that developers can use when building their AI detection tools. Each different model has its own pros and cons. Not only that, but they also use different training datasets, and how they evaluate this information before testing can also differ greatly.

AI detection developers use different approaches when creating their tools. The approach used defines the model and training of the AI detection tools and how it works. It’s important to understand these differences as they can affect the efficacy, value, and even cost of the final tool. The most popular approaches include the feature based, zero-shot, and fine-tuning AI model approaches.

Despite the abundance of choices when it comes to an AI detector, knowing which ones are the best to use can be difficult. Aside from the above mentioned differences in training models and approaches, there are also a number of other factors that affect whether or not you choose a particular AI detector. These factors include: 

  • Performance: How well does the tool perform? Is it fast? Are the results given in real time? Performance is crucial to how users experience the tool.
  • Accuracy: How accurate is the tool? This is probably one of the most important aspects to consider when it comes to AI detection tools. 
  • Reliability: Is the tool reliable not only in the results it provides, but in performance as well? Having a tool that you can rely on is essential when using a tool for business purposes.
  • Robustness: Can the tool be used for a wide range of content types? Does it offer additional features that add to its usefulness? Getting a tool that offers multiple applications and features gives users more options.
  • Financial considerations: Is the tool priced competitively? Does the subscription fee include any beneficial extras, such as additional features. All businesses have budgets to be mindful of and the tools you pay for need to justify that expense.

What Needs To Be Considered When Testing AI Detection Tools?

No test is complete if you don’t know what needs to be evaluated. Understanding the aspects that play a role in the effectiveness of a particular tool is essential during testing. Here are a few of the aspects we considered during our test which can be replicated in future evaluations.

Accuracy and Precision

Evaluate how accurately the AI detector identifies and classifies different instances of content. Measure precision, recall, and F1 score to understand the trade-offs between false positives and false negatives. This provides a more accurate overview of the tool’s true potential.

Training Data Quality

Examine the quality and representativeness of the training data. A diverse and comprehensive dataset is crucial for the AI model to generalize well to various scenarios. This can be difficult to evaluate as not all tools are upfront or transparent regarding the training dataset used.

Adversarial Testing

Assess the AI tool’s robustness against adversarial attacks. Test how well it performs when presented with intentionally modified input designed to deceive the model. This is AI generated content that’s been edited and modified by a human, or even using AI paraphrasing tools. 

Generalization Across Domains

Check whether the AI model generalizes well across different environments, conditions, or contexts. A model trained on specific data should still perform effectively in real-world scenarios.

Real-time Performance

Evaluate the detection tool’s speed and efficiency, especially in real-time applications. Latency and slow processing speed can be critical for certain use cases.

False Positives and Negatives

Understand the consequences of false positives and false negatives in the specific application. In some cases, minimizing false positives might be a priority for developers, while in others, reducing false negatives may be more critical.


Assess how easily the AI detection tool’s outputs can be interpreted and understood. Transparent models are essential for applications where decision-making processes need to be explained.

Handling Edge Cases

Test how well the AI detection tool performs in edge cases or scenarios that may not have been adequately represented in the training data. This helps identify potential limitations and biases. This can include the complexity of the topic, or including a high level of technical information.

Monitoring and Feedback Mechanisms

Implement monitoring systems to continuously assess the performance of the AI detection tool in real-world conditions. Does the tool incorporate feedback loops to improve the model over time? This can be another difficult aspect to evaluate as some tools aren’t transparent regarding how frequently the tool is updated, or what happens to user feedback.


Evaluate the scalability of the AI detection tool to handle increasing amounts of data. Ensure that the model’s performance remains consistent as the workload grows. Being able to process bulk orders is essential in certain use cases.

Security and Privacy

Examine the security measures in place to protect the AI model from attacks or misuse. Assess the tool’s compliance with privacy regulations and its handling of sensitive information. Some tools take greater steps to uphold security and privacy regulations, whereas others, especially some free tools, are more lax. 

Integration with Existing Systems

Test the ease of integration with other software and hardware components. This may be a requirement for certain use cases and might not be as essential for one-time users. 

User Interface and Experience

Assess the user interface and experience of the AI detection tool. A user-friendly interface can enhance the tool’s usability and adoption. Users are often willing to overlook minor shortcomings in a new tool if it’s easy to use.

Documentation and Support

Check the availability and comprehensiveness of documentation. Ensure there is adequate support for users, including troubleshooting guides and customer support.  Keep in mind that you’re most likely only able to truly experience a tool’s customer support in the paid version. Free tools might have very limited, or no, support available.

Our Test Methodology in Detail

Now that we have a better understanding about how these models work and what to look for when evaluating them. Let’s take a look at how we went about testing these tools to measure their efficacy. 

Datasets Used in Testing

For this test we set up three different types of content for testing. These sets differed in their complexity, ranging from a general topic, to more technical, to advanced. The goal with having these different types of content was to measure how well the AI tools were able to distinguish AI content from humans even in more technical topics and writing styles.  Each group of articles was then further subdivided into different sections. 

AI-Generated Articles

First, we created a set of completely AI generated articles. These articles were created using ChatGPT using various prompts to cover the same topic. Once generated they weren’t edited or altered in any way – not even to verify or check facts. These articles had to be kept 100% AI generated for the test results to be accurate.

AI-Generated and Human Edited

The next batch of articles consisted of AI generated, but human edited articles. For this batch, we didn’t use AI paraphrasing tools. Neither did we put the text through an AI detector and then make changes based on the results. We specifically wanted to see if the AI detector would be able to accurately identify that AI was used even though a real person edited the content.

Human Written

Finally, our last batch consisted of human written articles. For this, we used articles written by some of our very best writers. This ensures we had articles with different writing styles and tones included in our testing data. 

It’s important to note that the human batch included double the amount of articles. The reason for this comes down to simple mathematics. The first two batches, AI and AI edited by a human, should be picked up as a true positive when put through the tool. The human written ones should be picked up as true negatives since no AI was used during the process. As such, to make sure the results were fair, we needed to include double the amount of human written articles so that there were an equal amount of true positives compared to true negatives.

Deep Analysis and Full Test

First, we needed to create a spreadsheet where we could systematically record our findings. This sheet formed the basis of our test as it included the formulas to calculate accuracy, precision, recall, and the F1 score. This data forms the basis required to understand the true accuracy of each individual tool. It also gave us metrics that could be used to measure the tools against competitors in order to determine which ones are the best.

Once our spreadsheet was set up, we were ready to start the actual testing. We opened each AI detector in our list. We started with the top 15 based on the top results in search engines. These 15 don’t include all of the AI tools currently in existence, but we wanted to start off with those that are currently considered the best, and continue to expand our test to lesser well known tools in the future. This would enable us to give any up-to-date perspective on these tools and how they’re improving.

We would put our articles through the AI detector and record the results in our spreadsheet. Since the dataset we used was our own, we knew exactly what the results should be. This enabled us to easily identify and record any false results. 

While testing and recording the actual results of each article we submitted, we also took note of other aspects of the tool being tested. This included the overall performance and experience of the tool, the robustness of the features, how easy it was to scale, and we also compared the free and paid versions where applicable. For tools that specified their training data sets and models, we took note of those as well. We also made note of tools that were difficult to use because they took a long time, or because they had numerous ads to navigate through which affected the user experience. 

By testing each tool with numerous articles, we didn’t only get a good idea of their true accuracy, but we also got to experience how the tool works. This enabled us to get a deeper understanding of each of the tools being tested, where they stood out, and even areas that could be improved.

Measuring Outcomes and Results

In order to gain concrete evidence on whether these tools are effective at determining if content was created using AI, we needed to use specific quantitative metrics. Let’s take a closer look at each of these metrics and what they demonstrated in the results. 

When putting the articles through the test, they could result in either of these four metrics:

  • True Positive (TP): This is the number of accurately identified instances of AI. The article was created using AI, and the tool correctly identified it as such.
  • True Negative (TN): This is the number of accurately identified instances of human-created content. The article was written by a human, and the tool correctly identified it as such.
  • False Positive (FP): This is the number of incorrectly identified instances of AI. The article was human written, but the tool identified it as AI.
  • False negative (FN): This is the number of instances incorrectly identified and human-written. The article was created using AI, but the tool identified it as being written by a human.

Once all of the articles were identified, we could use the results to calculate the following metrics which gives us a deeper insight into how accurate and reliable each tool really is.

  • Accuracy: This is the percentage of predictions that were identified correctly. On its own this metric can be misleading, but when evaluated alongside the other metrics the real results become clear. This is why you should be skeptical of very high accuracy results promised by these tools. The formula used to calculate the accuracy is: (TP + TN) / (TP + TN + FB +FN).
  • Precision: This calculates the ratio of true positives to the total number of positive results. The higher the precision, the fewer false positives were identified. The formula used to calculate precision is: (TP / (TP + FP))
  • Recall: This measures the ratio between true positives and actual positives. A higher recall suggests fewer false negatives. The formula used to calculate recall is: (TP / (TP + FN)).
  • F1 Score: The F1 score is a more balanced assessment of the overall accuracy of the AI detector. It combines precision and recall into a single metric to rank all detectors. This metric is essential in a test like ours where we evaluate and compare many different AI detectors. The formula for calculating the F1 score is: 2 x (Precision x Recall) / (Precision + Recall).

With these metrics, we were able to get a much better overview of the true accuracy and reliability of each AI detector in our test. These metrics also made it possible for us to create a leaderboard with the best tools based on measurable results.

Final Thoughts on AI Detection Tools

Understanding the effect that AI detection tools have on the content creation industry is essential for companies like ours. Not only is AI affecting our present, but these tools will continue to grow and improve, thereby affecting our future as well. We have first hand experience on the effect that AI content creation has on the industry. As such, we took it onto ourselves to remain at the forefront of AI detection and combating the inappropriate use of these generators.

Captain Words is dedicated to continuing producing top quality content at a fair price, but we’re also committed to ensuring our content is original and created by talented human writers. Take a look at our services to see what we can do for you.

Share This Article


Leri Koen

After spending several years in the fields of Education, Child Development, and Hospitality, Leri decided to embrace her passion for content. Today, she is helping businesses grow digitally through her skills as a content specialist.

Get 5,000 Words Free!

Contact our team to find out how we can help you scale your content to reach a global audience!

Get 5,000 Words Free!

We’d love to show you how our team of passionate writers can help boost your ROI and improve your content. Just fill out the form below for a free, no obligation sample!