Why AI Detectors Keep Flagging Students Who Didn't Cheat

Orion Newby submitted an essay for a class at Adelphi University in November 2024. His professor ran it through Turnitin and got back what the lawsuit complaint describes as a 100% AI score. Newby has autism, gets tutoring support through Adelphi's Bridges program for grammar help, and submitted independent checks from Grammarly and ZeroGPT that both called the essay human written. The university's own Turnitin originality report showed only 4% overlap with existing sources. None of that mattered. The integrity office upheld the violation anyway, and Newby is now suing.

He's not alone. A University of Michigan student with OCD and anxiety disorder is suing over the same pattern. Two University at Buffalo students had their final papers flagged in May 2025, and about 20% of their classmates got flagged in the same batch. Louise Stivers, a UC Davis senior, found out a classmate named William Quarterman had already failed an exam because GPTZero flagged his answers, days before she went through the same process herself.

I write software for a living. If I shipped a fraud detection model with a documented false positive rate north of 4% and let it single-handedly end someone's job, I'd be fired and probably sued. Schools do this to teenagers every semester and call it academic integrity.

What these tools actually measure

AI detectors don't read for meaning. They run your text through a language model and score two things: perplexity and burstiness.

Perplexity measures how predictable each word is given the words before it. Feed a sentence into a classifier built on something like RoBERTa or DeBERTa, and it scores how surprised the model is by your word choices. Large language models pick the statistically likely next token, so AI-generated text tends to score low. Human writing usually scores higher because people make weirder choices: idioms, personal phrasing, the occasional bad sentence.

Burstiness measures how much that perplexity score swings across a document. Human writers are inconsistent. A plain sentence, then a tangent, then something clunky, then a sharp one. AI models hold a steadier line because the same generation process runs at every position. Low burstiness plus low perplexity reads as a strong AI signal to most commercial detectors.

Here's the part vendors don't put on the homepage: that signal also describes a lot of human writing. Simple sentence structure. Repetitive phrasing. Formal, consistent tone. That's exactly how ESL writers are taught to write, and it's exactly what happens after a writing-support tool cleans up a draft. A Stanford-linked study found detectors misclassified TOEFL essays from Chinese students at a mean rate of 61.3%, against 5.1% for US student essays in the same test. Newby's case fits the same shape: structured, tutor-assisted writing that reads as "too clean" to a classifier trained mostly on a different population.

The math nobody mentions in the disciplinary hearing

Even a detector with genuinely good numbers falls apart once you account for how rare AI cheating actually is in a given class.

# base rate problem: most flags are false alarms when the
# thing you're looking for is uncommon, even with a "good" detector

prevalence = 0.05          # share of essays actually AI-written
false_positive_rate = 0.01 # vendor's marketed FPR
true_positive_rate = 0.95  # vendor's marketed catch rate

true_positives = prevalence * true_positive_rate
false_positives = (1 - prevalence) * false_positive_rate

precision = true_positives / (true_positives + false_positives)
print(f"{precision:.0%} of flagged essays actually used AI")
# 83% of flagged essays actually used AI

Run the numbers and a tool with a marketed 1% false positive rate still gets it wrong on roughly 1 in 6 flags, from base rates alone. That's the best case. Turnitin's own published sentence-level false positive rate is 4%, which means a 650-word essay should expect two or three wrongly flagged sentences as a baseline, before anyone runs a real human's writing through it. Scale that to the roughly 2.2 million students Turnitin processes and a 4% rate flags 88,000 real students a year. ZeroGPT, a free tool plenty of teachers use because it's free, sits at a 16.2% false positive rate in independent 2026 benchmarks. One in six human essays gets flagged, full stop.

The bias compounds

False positive rates aren't evenly distributed. Stanford's research on non-native English writers already covers one group. Add neurodivergent students, who tend toward repetitive phrasing or highly structured organization that classifiers associate with machine output, and you've stacked two protected categories on top of a tool nobody validated for either. Newby's autism. The Michigan student's OCD and anxiety. These aren't edge cases the vendors forgot to test for. They're the exact writing patterns perplexity and burstiness scoring will misread, every time, by design.

Ramirez, a Cal State Monterey Bay professor who studies AI in K-12 settings, runs her own papers through detectors as a sanity check before disciplining a student. They flag her at roughly 98% almost every time. She's not using AI. She's just a clean, confident writer, which is apparently indistinguishable from a language model if you only look at perplexity.

What's actually working

UCLA and UC San Diego both turned off their AI detectors between 2024 and 2025, concluding the false positive rate created more academic integrity risk than it prevented. Vanderbilt did the same. These aren't small or under-resourced schools cutting corners. They ran the tool, looked at the wrongful-accusation rate, and pulled it.

Turnitin's own chief product officer has said publicly that a score should never be the sole basis for action against a student, that it should open a conversation instead of close one. GPTZero carries the same disclaimer. Neither company says that loudly in a sales call, but it's in writing, and it should be the standard every school actually follows instead of the one buried in a terms-of-service page.

The schools getting this right treat detector output the way a competent engineer treats a flaky test: a signal to investigate, never a verdict. They ask for drafts, revision history, the messy in-progress version a student actually has if they wrote the thing themselves. If you're a CS student, this is the one practical move worth making regardless of what your school does: keep your version history. Google Docs has it built in. If the assignment is code, your git log already is the proof. A clean save with no edit trail is the actual red flag, not your sentence structure.

My actual take

The ethical failure here was never "should students use AI." It's that schools took an unvalidated statistical classifier, never independently audited, trained on an unpublished and skewed sample, and handed it the authority to end someone's semester with no due process attached. A 4% to 16% false positive rate would get a fraud model pulled from production at any company I've worked with. In a classroom it gets a kid pulled into a disciplinary hearing with no advisor and no transcript of the evidence against them.

If you're building anything that scores human behavior and feeds a punitive decision, whether that's plagiarism, fraud, or content moderation, publish your false positive rate, publish what population you validated it on, and never let your own score be the only input to a consequence. Schools adopted these tools faster than anyone audited them. That's the actual scandal, not the existence of ChatGPT.

What these tools actually measure

AI detectors don't read for meaning. They run your text through a language model and score two things: perplexity and burstiness.

The math nobody mentions in the disciplinary hearing

Even a detector with genuinely good numbers falls apart once you account for how rare AI cheating actually is in a given class.

# base rate problem: most flags are false alarms when the # thing you're looking for is uncommon, even with a "good" detector prevalence = 0.05 # share of essays actually AI-written false_positive_rate = 0.01 # vendor's marketed FPR true_positive_rate = 0.95 # vendor's marketed catch rate true_positives = prevalence * true_positive_rate false_positives = (1 - prevalence) * false_positive_rate precision = true_positives / (true_positives + false_positives) print(f"{precision:.0%} of flagged essays actually used AI") # 83% of flagged essays actually used AI

The bias compounds

What's actually working

My actual take

Why AI Detectors Keep Flagging Students Who Didn't Cheat

What these tools actually measure

The math nobody mentions in the disciplinary hearing

The bias compounds

What's actually working

My actual take

Arbind Singh

Comments

Leave a comment

Deepfakes and the AI Ethics Gap in Indian Classrooms

Why AI Detectors Keep Flagging Students Who Didn't Cheat

What these tools actually measure

The math nobody mentions in the disciplinary hearing

The bias compounds

What's actually working

My actual take

Arbind Singh

Comments

Leave a comment

Deepfakes and the AI Ethics Gap in Indian Classrooms