AI-text detection gets harder under human–AI co-editing — the "OpAI-Bench" progressive-editing benchmark
As AI writing assistants spread, documents are increasingly the product of progressive human–AI co-editing rather than purely human or AI. OpAI-Bench studies human-to-AI transformation at document, sentence, token, and span granularities — and finds that mixed-authorship "intermediate" versions are often harder to detect than fully human or heavily AI-edited endpoints (a non-monotonic pattern).
Paper overview (our summary)
- Field (arXiv category)cs.CL(+2)
- AuthorsSondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, et al. (12)
- Submitted2026-06-04
- arXiv ID2606.06481v1
Key points
- Premise: documents are products of progressive human–AI co-editing, not purely human or AI
- Benchmark (OpAI-Bench) studies human-to-AI transformation at document/sentence/token/span granularities
- Nine sequential revisions, five AI edit operations, four domains, with provenance preserved
- Detectability depends on edit operation, domain, and revision history — not just AI proportion
- Mixed intermediate versions are harder to detect than either endpoint (non-monotonic)
This work (OpAI-Bench) reframes AI-text detection in the realistic context of co-editing.
As AI writing assistants become integrated into drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but result from progressive human–AI co-editing. Yet existing AI-text detection benchmarks largely focus on final outputs and offer limited insight into how AI-authorship signals emerge, accumulate, or disappear throughout revision.
Starting from human-written documents, the authors construct nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at the document, sentence, token, and span granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors.
Experiments reveal that detectability is governed not only by the proportion of AI-edited content, but also by the edit operation, the domain, and the cumulative revision history. Notably, mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints — exposing non-monotonic detection patterns missed by prior benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing.
Why it matters
A read on AI-text detection, provenance, and content authenticity. The finding that detection becomes non-monotonically harder under co-editing is useful for anyone deploying AI detection in education, publishing, or platforms.
FAQ
Why are intermediate versions harder to detect?
What is it useful for?
Sources (primary)
Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.
- arXiv abstract page (original, official)
- PDF (arXiv)
- arXiv ID: 2606.06481