For the last few years, the default architecture for “AI over documents” has been almost automatic: chunk the corpus, create embeddings, store vectors, retrieve top-k matches, then ask the model to answer from the retrieved context.
That pattern is useful. It is also becoming too narrow for some agentic workflows.
A new paper, Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction, makes a deliberately uncomfortable point: if the AI system is an agent, and the corpus is available as raw files, maybe the best retrieval interface is not always an embedding model or a vector index. Sometimes it is simply the ability to search the corpus directly.
No embedding model. No vector index. Just direct access to the documents, plus an agent that knows how to search, inspect, refine, and verify.
Direct Corpus Interaction, or DCI, is a simple idea with large implications. Instead of giving the agent one retrieval API that returns a ranked list of passages, DCI gives the agent access to the raw corpus through ordinary tools: grep or ripgrep, file reads, directory navigation, shell commands, and small scripts.
That sounds almost primitive compared with modern semantic search stacks. But the primitive interface is precisely the point. A command-line search interface is composable. The agent can look for an exact phrase, inspect the surrounding context, combine two weak clues, count occurrences, search again with a refined hypothesis, and then verify whether the evidence actually supports the answer.
Traditional retrieval compresses this interaction into one narrow step: query in, ranked list out. If a useful clue is filtered out early, the downstream model may never see it. Better reasoning after the fact cannot recover evidence that was never exposed.
Vector retrieval is strong when the problem is semantic similarity: “find passages like this question.” But real investigations often require more than similarity.
A top-k retriever can support some of this, but it is not a natural fit. It wants to return candidates. An agent often wants to run an investigation.
The practical lesson is more subtle. The paper does not prove that vector databases are useless. Dense retrieval, sparse retrieval, and reranking still make sense for many large, static, consumer-facing or latency-sensitive systems.
The important claim is that retrieval quality depends on the interface the agent receives, not only on the retriever model behind it. If the agent can reason, search, and revise its plan, then a richer interface can expose evidence that a fixed similarity step hides.
The authors call this idea retrieval interface resolution. A conventional retriever usually exposes documents or chunks. DCI lets the agent operate at higher resolution: file paths, exact matches, local spans, counts, constraints, and follow-up searches.
Many companies are building internal AI assistants over policies, SOPs, risk files, technical documentation, meeting notes, audit records, product specifications, and code repositories. These corpora are not always clean, static, or perfectly chunkable. They change daily. They contain tables, filenames, cross-references, abbreviations, and exact wording that matters.
In such environments, DCI is attractive because it reduces infrastructure assumptions. There is no mandatory embedding pipeline. No offline index build before the first useful question. No stale vector store quietly drifting away from the current state of the documents.
For regulated or quality-sensitive teams, that is not a small operational detail. The ability to inspect the exact source text, trace the search path, and verify local context is often more valuable than a polished semantic answer that cannot easily explain how it found its evidence.
DCI points toward a different design pattern. Instead of treating retrieval as a hidden preprocessing layer, treat corpus access as a tool layer the agent can use deliberately.
DCI is especially relevant when the agent operates over a bounded but rich corpus and when exact evidence matters.
It is less attractive when the corpus is enormous, remote, unstructured in hostile ways, or where response latency must be extremely predictable. In those cases, indexing and retrieval pipelines still earn their keep.
There is also a governance lesson here. When AI systems are used in serious business processes, we should not only ask whether the final answer is plausible. We should ask whether the system had a good way to find, preserve, and inspect evidence.
A black-box retrieval step can make evidence selection hard to challenge. DCI does not automatically solve that problem, but it makes the investigation more visible. Search commands, opened files, and local context can be logged and reviewed. That is useful for validation, debugging, and auditability.
The most interesting part of DCI is not that it uses grep. It is that it treats the agent as an active researcher rather than a passive consumer of top-k snippets.
That fits the direction AI systems are moving. As agents become better at planning and tool use, the limiting factor is often not the language model alone. It is the interface we give it. A narrow interface produces narrow evidence. A richer interface allows better investigation.
For business leaders, the takeaway is practical: do not start every knowledge-assistant project by buying a vector database and declaring the architecture solved. First ask what kind of evidence the agent must find, how exact the search needs to be, how often the corpus changes, and how the search path will be verified.
Direct Corpus Interaction is a useful reminder that retrieval is not just a model choice. It is an interface design problem.
Embedding search remains valuable. But for agentic systems working over local, evolving, evidence-heavy corpora, direct search over raw documents may be simpler, more transparent, and sometimes stronger. The future of retrieval will probably not be “vectors or grep.” It will be agents with the right mix of tools, using the right interface for the job.
And occasionally, the right interface will look suspiciously like a terminal.
We help organisations navigate complex regulatory and technology challenges. Let’s talk.
Get in Touch