cs.CL cs.AI cs.LG

AI-text detection gets harder under human–AI co-editing — the "OpAI-Bench" progressive-editing benchmark

cs.CL ・Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, et al. (12) ・Jun 2026

As AI writing assistants spread, documents are increasingly the product of progressive human–AI co-editing rather than purely human or AI. OpAI-Bench studies human-to-AI transformation at document, sentence, token, and span granularities — and finds that mixed-authorship "intermediate" versions are often harder to detect than fully human or heavily AI-edited endpoints (a non-monotonic pattern).

Paper overview (our summary)

Field (arXiv category)cs.CL（+2）
AuthorsSondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, et al. (12)
Submitted2026-06-04
arXiv ID2606.06481v1

Key points

Premise: documents are products of progressive human–AI co-editing, not purely human or AI
Benchmark (OpAI-Bench) studies human-to-AI transformation at document/sentence/token/span granularities
Nine sequential revisions, five AI edit operations, four domains, with provenance preserved
Detectability depends on edit operation, domain, and revision history — not just AI proportion
Mixed intermediate versions are harder to detect than either endpoint (non-monotonic)

This work (OpAI-Bench) reframes AI-text detection in the realistic context of co-editing.

As AI writing assistants become integrated into drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but result from progressive human–AI co-editing. Yet existing AI-text detection benchmarks largely focus on final outputs and offer limited insight into how AI-authorship signals emerge, accumulate, or disappear throughout revision.

Starting from human-written documents, the authors construct nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at the document, sentence, token, and span granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors.

Experiments reveal that detectability is governed not only by the proportion of AI-edited content, but also by the edit operation, the domain, and the cumulative revision history. Notably, mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints — exposing non-monotonic detection patterns missed by prior benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing.

Why it matters

A read on AI-text detection, provenance, and content authenticity. The finding that detection becomes non-monotonically harder under co-editing is useful for anyone deploying AI detection in education, publishing, or platforms.

FAQ

Why are intermediate versions harder to detect?

When human and AI edits mix, the distinctive signals of purely human or AI text weaken, leaving detectors fewer cues. The paper demonstrates this non-monotonic effect empirically.

What is it useful for?

It clarifies the limits and scope of AI-text detectors — helping education, publishing, and platforms avoid over-trusting detection under realistic co-editing.

Sources (primary)

Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.

arXiv abstract page (original, official)
PDF (arXiv)
arXiv ID: 2606.06481

#AI#arXiv#Research paper#AI-text detection#NLP#LLM

← Back to AI research paper watch