Most AI benchmarks test for clean conditions: a clear prompt, a well-structured input, a measurable output. That’s not how real work arrives. Real work arrives as a folder of contradictory files, a chain of edited drafts with margin comments that cancel each other out, and a CRM export from a sales process that has been running for four months without a clear owner.
We ran Claude Fable 5 — accessed through Noumi — against two batches of exactly that kind of material. Not polished inputs. Not structured prompts designed to elicit good performance. Actual project residue: the kind of thing that piles up between Monday’s kickoff call and Thursday’s deadline. The results were specific enough to write about.
What “Messy Input” Actually Means
There’s a version of messy that AI handles well: a long document with inconsistent formatting, or a paragraph-heavy brief that needs structure. Models have been good at that for a while. The harder version — the one that actually separates capable models from ones that produce plausible-sounding output — is material that is internally contradictory, spread across multiple sources that were never meant to be read together, and encoded with context that only makes sense if you understand the work it came from.
That’s the test we ran. Not “can it clean up messy prose” but “can it work with material where the useful signal is partially buried, partially contradicted, and nobody has pre-labeled what matters.”
The two bundles were real project residue, lightly stripped of identifying information.
Bundle one was brand and creative direction material: an annotated executive revision of a campaign brief, an older campaign that had partially been greenlit and partially rejected, a set of brand guardrails with a specific prohibition the marketing team had argued about, a collection of creator feedback from a pilot program, and three separate notes-app snippets that someone had labeled “possible angle” without finishing the thought.
Bundle two was B2B sales material from an ongoing product trial: direct client quotes pulled from meeting recordings, CRM notes that partially contradicted those quotes, a single data point from the trial period (12 files uploaded, 2.1 converted into sales-reusable material in the first week), and four meeting transcripts in varying states of completeness.
Neither bundle was designed to be easy. Both were designed to be realistic.
Why Clean Drafts Don’t Actually Move Work Forward
The failure mode we were testing for is familiar to anyone who has used AI on messy project material before. The model reads everything, produces a well-organized synthesis, and hands it back to you. The synthesis is accurate. It is also not especially useful, because the value wasn’t in reorganizing the information — it was in making a judgment call about which parts of the conflicting material should drive the next decision.
“Good draft” is not the same as “moves the work forward.” A good draft can exist without taking a side on the internal contradiction between the executive revision and the original brief. It can summarize the CRM notes without flagging that three of them suggest the sales process stalled for a reason that isn’t in any of the formal summaries. It can be correct and comprehensive and still leave the next person with exactly the same problem the material started with.
This is the pattern that knowledge workers describe when they say AI is “good for drafting but not for actual thinking.” The drafting is real. The thinking — the part where the output is shaped by who will use it, how they’ll use it, and what the actual decision it needs to support is — often isn’t.
Two Tests, Two Conclusions
The Brand Bundle: Finding the Specific Claim Buried in the Contradictions
The brief for the brand bundle was explicit: don’t produce a finished campaign, produce something that can anchor a working session. Something the team can argue about and revise, with key judgments traceable back to the material.
The output that came back did not give us “Unleash your creative potential” — the kind of campaign language that is technically compliant with every brief and immediately forgettable. It gave us a specific proposition: your inspiration shouldn’t have to reintroduce itself every time. The attached script opening read: Starting a new project feels like registering yourself all over again.
The reason this line works is traceable. It didn’t appear from nowhere — it’s a synthesis of a specific pattern in the creator feedback (repeated references to re-explaining their aesthetic preferences to new tools and collaborators) and a phrase in the older rejected campaign that the executive had circled and marked “almost.” The model surfaced the connection, named the creative insight, and tied it to source material. That’s a working session anchor, not a finished output.
What it got wrong is equally worth noting: the closing line it drafted (Next time, start from everything you’ve already built) reads like the last slide of a pitch deck. It’s too neat, too resolved. A creative director would rewrite it. A good editor would cut it. That’s fine — and expected. The value was in the middle, not in having a complete final product delivered automatically.
The Sales Bundle: Finding What the Data Was Actually Saying
The sales brief was different: the trial data already existed, the problem was that nobody had connected the dots in a way sales could act on.
The line that came back as the core finding was: the materials are in the system, but the sales actions haven’t emerged from them. That’s not a paraphrase of the input data. The input data said that 12 files were uploaded and 2.1 became reusable in week one. The model connected that number to the client quote about “AI summary is fine, but it doesn’t reach sales action,” and to the CRM note about a sales rep returning to a generic product demo despite having client-specific material available. The finding was an interpretation, not a retrieval.
What followed from that finding was structured and immediately usable: a sales follow-up email (too direct for external use as written, but accurate for internal reference), a revised demo framing, a risk flag for the specific objection pattern that appeared in two of the four transcripts, and a source index that mapped each claim back to the original material.
The source index was the most practically valuable piece. Every assertion was traceable: “sales reuse didn’t happen” was pinned to the client quote, the trial number, and the CRM note. Someone reviewing the output could disagree with the interpretation and know exactly where to go to argue about it.
“But Won’t Any Good Model Do This Now?”
The capability jump in frontier models over the past year has been real, and the objection that “any good model can handle this” deserves a direct response.
It’s partly right. The raw language capability required to produce the outputs above isn’t unique to any single model. The question is what conditions need to exist for that capability to produce useful work consistently, across a full project rather than a single prompt.
First, the context architecture matters more than the model ceiling. The brand bundle result depended on the model having access to five different source files simultaneously and treating them as a connected project rather than a queue of documents to process one at a time. A model with identical language capability but session-by-session memory would require significant re-orientation on every interaction. The session where the campaign line appeared wasn’t the first session with that material.
Second, task framing at the model level doesn’t substitute for workspace-level accumulation. A well-crafted system prompt can orient a model toward a specific task. It can’t carry the interpretive history of three weeks of work on the same project. The sales bundle output drew on context from earlier in the trial period — context that no single prompt could have contained.
Third, the output quality scales with accumulated project context in a way that one-off prompt engineering doesn’t. The models that produce the most useful outputs on messy real-world material aren’t doing so because of a better prompt. They’re doing so because persistent context across the project means the material is already indexed and partially interpreted before the next task begins.
How to Evaluate Whether an AI Actually Handles Messy Input
The right test isn’t whether a model can produce a clean output from clean input. The right test is:
Four dimensions that distinguish capable from marginal performance on this test:
Source Traceability
Does the output show its work? For any judgment call — a creative direction choice, a risk flag, a sales insight — can you trace it back to specific source material? Output that reads well but can’t be source-traced can’t be interrogated or revised with confidence.
Contradiction Handling
When the input material contradicts itself (executive revision vs. original brief; client quote vs. CRM note), does the model pick a side and say so, or does it average the contradictions into something that technically represents both positions and is actually useful for neither?
Audience-Specificity
The most common failure mode after producing a “good summary” is producing output that’s addressed to no one in particular. Did the model ask — implicitly or explicitly — who this output is for and what they’ll do with it? The sales email and the internal risk brief are different documents even when they draw on the same source material.
Friction Direction
Does the work get easier as more project context accumulates? For tasks that require autonomous multi-step execution across a project with growing source material, the friction should decrease over time, not stay constant. If week-four sessions take the same setup time as week-one sessions, the context isn’t accumulating in a form that’s actually being used.
The practical test: take a real piece of messy project material — something with internal contradictions, multiple source files, and an output that needs to be used by a specific person for a specific purpose — and run it. If what comes back is accurate but vague, or comprehensive but untraceable, that’s the benchmark to beat.
Frequently Asked Questions
Getting Started
The test that matters for AI on real project work isn’t whether it can write — it’s whether it can work. Working means engaging with material that is incomplete, contradictory, and not pre-organized for AI consumption, and producing output where the judgment calls are visible, traceable, and improvable by the people who will actually use the result.
That’s what the two bundles above were testing for. The brand output gave us something worth arguing about in a working session. The sales output gave a team something they could act on the next morning without spending an hour re-reading twelve source files. Both results depended on accumulated project context that couldn’t have been replicated from a single session prompt.
Noumi is built around that premise — a workspace where project context accumulates across sessions so the work that matters most can start from what you’ve already built, not from a blank window. Try Noumi →