Direct Corpus Interaction: Why Agentic Search May Not Need a Vector Database First

For the last few years, the default architecture for “AI over documents” has been almost automatic: chunk the corpus, create embeddings, store vectors, retrieve top-k matches, then ask the model to answer from the retrieved context.

That pattern is useful. It is also becoming too narrow for some agentic workflows.

A new paper, Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction, makes a deliberately uncomfortable point: if the AI system is an agent, and the corpus is available as raw files, maybe the best retrieval interface is not always an embedding model or a vector index. Sometimes it is simply the ability to search the corpus directly.

No embedding model. No vector index. Just direct access to the documents, plus an agent that knows how to search, inspect, refine, and verify.
Concept diagram showing an AI agent using terminal search over raw documents instead of a vector index
Direct Corpus Interaction changes the retrieval interface: the agent works with the corpus directly, using search and file inspection as part of the reasoning loop.

What Direct Corpus Interaction changes

Direct Corpus Interaction, or DCI, is a simple idea with large implications. Instead of giving the agent one retrieval API that returns a ranked list of passages, DCI gives the agent access to the raw corpus through ordinary tools: grep or ripgrep, file reads, directory navigation, shell commands, and small scripts.

That sounds almost primitive compared with modern semantic search stacks. But the primitive interface is precisely the point. A command-line search interface is composable. The agent can look for an exact phrase, inspect the surrounding context, combine two weak clues, count occurrences, search again with a refined hypothesis, and then verify whether the evidence actually supports the answer.

Traditional retrieval compresses this interaction into one narrow step: query in, ranked list out. If a useful clue is filtered out early, the downstream model may never see it. Better reasoning after the fact cannot recover evidence that was never exposed.

Why vector search can become a bottleneck

Vector retrieval is strong when the problem is semantic similarity: “find passages like this question.” But real investigations often require more than similarity.

  • Exact constraints: product names, regulatory clauses, error strings, dates, IDs, file names, and uncommon phrases.
  • Conjunctions of weak clues: several terms that are not individually decisive but become useful together.
  • Multi-hop discovery: find one entity, inspect it, discover another entity, then search again.
  • Local verification: read the lines before and after a match to understand whether the evidence is actually relevant.

A top-k retriever can support some of this, but it is not a natural fit. It wants to return candidates. An agent often wants to run an investigation.

The paper’s claim is not that embeddings are obsolete

The practical lesson is more subtle. The paper does not prove that vector databases are useless. Dense retrieval, sparse retrieval, and reranking still make sense for many large, static, consumer-facing or latency-sensitive systems.

The important claim is that retrieval quality depends on the interface the agent receives, not only on the retriever model behind it. If the agent can reason, search, and revise its plan, then a richer interface can expose evidence that a fixed similarity step hides.

The authors call this idea retrieval interface resolution. A conventional retriever usually exposes documents or chunks. DCI lets the agent operate at higher resolution: file paths, exact matches, local spans, counts, constraints, and follow-up searches.

Why this matters for enterprise knowledge systems

Many companies are building internal AI assistants over policies, SOPs, risk files, technical documentation, meeting notes, audit records, product specifications, and code repositories. These corpora are not always clean, static, or perfectly chunkable. They change daily. They contain tables, filenames, cross-references, abbreviations, and exact wording that matters.

In such environments, DCI is attractive because it reduces infrastructure assumptions. There is no mandatory embedding pipeline. No offline index build before the first useful question. No stale vector store quietly drifting away from the current state of the documents.

For regulated or quality-sensitive teams, that is not a small operational detail. The ability to inspect the exact source text, trace the search path, and verify local context is often more valuable than a polished semantic answer that cannot easily explain how it found its evidence.

The architecture implication: retrieval becomes a tool layer

DCI points toward a different design pattern. Instead of treating retrieval as a hidden preprocessing layer, treat corpus access as a tool layer the agent can use deliberately.

  1. Start with the corpus as it is. Files, folders, exports, logs, markdown, PDFs converted to text, and structured data can all be searchable assets.
  2. Expose precise search primitives. Exact match, regex, metadata filters, file reads, and small scripts are not old-fashioned. They are controllable retrieval instruments.
  3. Let the agent iterate. The value comes from search, inspect, refine, and verify loops, not from one perfect initial query.
  4. Keep evidence observable. The system should preserve what was searched, what was opened, and which text supported the conclusion.
  5. Add semantic retrieval where it helps. Vector search can still be one tool among others, not the only gateway to knowledge.

Where DCI is likely to work well

DCI is especially relevant when the agent operates over a bounded but rich corpus and when exact evidence matters.

  • Software and engineering repositories: code, tickets, logs, requirements, design notes, and incident reports.
  • Quality and regulatory documentation: SOPs, risk management files, clinical evaluation records, audit trails, and technical files.
  • Internal research libraries: papers, notes, experiments, market analysis, and decision records.
  • Customer support knowledge bases: where exact product versions, error text, and procedural wording are important.

It is less attractive when the corpus is enormous, remote, unstructured in hostile ways, or where response latency must be extremely predictable. In those cases, indexing and retrieval pipelines still earn their keep.

The governance angle

There is also a governance lesson here. When AI systems are used in serious business processes, we should not only ask whether the final answer is plausible. We should ask whether the system had a good way to find, preserve, and inspect evidence.

A black-box retrieval step can make evidence selection hard to challenge. DCI does not automatically solve that problem, but it makes the investigation more visible. Search commands, opened files, and local context can be logged and reviewed. That is useful for validation, debugging, and auditability.

My view

The most interesting part of DCI is not that it uses grep. It is that it treats the agent as an active researcher rather than a passive consumer of top-k snippets.

That fits the direction AI systems are moving. As agents become better at planning and tool use, the limiting factor is often not the language model alone. It is the interface we give it. A narrow interface produces narrow evidence. A richer interface allows better investigation.

For business leaders, the takeaway is practical: do not start every knowledge-assistant project by buying a vector database and declaring the architecture solved. First ask what kind of evidence the agent must find, how exact the search needs to be, how often the corpus changes, and how the search path will be verified.

Conclusion

Direct Corpus Interaction is a useful reminder that retrieval is not just a model choice. It is an interface design problem.

Embedding search remains valuable. But for agentic systems working over local, evolving, evidence-heavy corpora, direct search over raw documents may be simpler, more transparent, and sometimes stronger. The future of retrieval will probably not be “vectors or grep.” It will be agents with the right mix of tools, using the right interface for the job.

And occasionally, the right interface will look suspiciously like a terminal.

Previous PostNext Post

Related Articles

Article

AI Drift: The Silent Risk in Mission-Critical Systems

Read →

Article

Four Ways AI Agents Fail When the Stakes Are High

Read →

Article

Your AI Agent Works in Dev. Production Is Where It Gets Expensive.

Read →

Related Services

Service

EU AI Act Readiness & Implementation

Learn More →

Service

Custom AI Model Development

Learn More →
Miloš Cigoj
Miloš Cigoj Founder, Excellence Consulting  ·  Operational Excellence & AI Strategy

Interested in this topic?

We help organisations navigate complex regulatory and technology challenges. Let’s talk.

Get in Touch