The Retrieval & Knowledge Engineering Catalog
A Catalog of Hybrid Search, Embeddings, and Knowledge Substrates
Draft v0.1
May 2026
Table of Contents
About This Catalog
This is the tenth volume in a catalog of the working vocabulary of agentic AI. The nine prior volumes covered patterns (the timing of agent runs), skills (model instructions in packaged form), tools (the function-calling primitives), events and triggers (what activates the agent), fabric (the substrate beneath orchestration), memory (state, context, and recall), human-in-the-loop (approval, observation, and interaction), evaluation and guardrails (the governance layer), and multi-agent coordination (the agent-to-agent communication layer). This tenth volume covers the discipline that makes agents useful in the first place: how an agent finds the right information in a corpus it doesn’t own, ranking it well enough that the answer above the fold is the answer that matters.
Retrieval is the oldest discipline in this catalog and the youngest. Information retrieval as an academic field predates LLMs by decades --- TF-IDF, BM25, learning-to-rank, the entire vocabulary of relevance --- and the foundations remain intact. What changed across 2023—2025 is everything above the foundations: dense embeddings became cheap enough to put in front of every retrieval; cross-encoder rerankers became cheap enough to put after; query understanding became LLM-driven; chunking became a design decision worth deliberation rather than a default to fifty tokens of overlap; and the agent layer made retrieval into a tool the agent reasons about rather than a fixed pipeline the system runs. The classical retrieval discipline has a new substrate to integrate with; this volume covers both.
The opening claim of this catalog is that retrieval and memory are different disciplines (Chapter 1), even though they share the vector store as a substrate. Volume 6 covered memory --- what the agent writes, summarizes, and recalls about its own history and context. This volume covers retrieval --- what the agent fetches from a corpus it consumes but doesn’t own. The engineering disciplines are different. Memory is read-heavy with writes from the agent itself, with eviction policies and freshness concerns dominating; retrieval is read-mostly with infrequent index updates, with ranking quality and corpus governance dominating. A team that conflates the two ends up with eviction policies on knowledge bases and ranking debates on conversation summaries, both of which are the wrong design pressures.
Scope
Coverage:
-
Hybrid search platforms: OpenSearch, Elasticsearch, Vespa --- the engines that combine BM25 with dense vectors.
-
Embedding models: proprietary APIs (OpenAI, Cohere, Voyage) and open-weights families (BGE, E5, Jina, Nomic).
-
Rerankers: cross-encoder rerankers (Cohere Rerank, BGE Reranker, Voyage Rerank) and late-interaction models (ColBERT, ColPali).
-
Document processing: Unstructured.io, LlamaParse, and the newer entrants (Marker, Docling, Reducto).
-
RAG orchestration frameworks: LlamaIndex retrievers, LangChain retrievers.
-
Query understanding patterns: HyDE, query decomposition, multi-query, query rewriting --- implementable across any framework.
-
Knowledge graphs and graph-augmented retrieval: Neo4j with vector indexes, Microsoft GraphRAG, LightRAG.
-
Discovery and benchmarks: MTEB, BEIR, awesome lists.
Out of scope:
-
Vector databases when treated as primary memory infrastructure --- covered in Volume 6.
-
Classical IR research (the academic foundations of TF-IDF, BM25, learning-to-rank) when treated as research rather than engineering substrate.
-
Enterprise search platforms (Coveo, Algolia, Lucidworks) when used outside the LLM-augmented retrieval context. They’re relevant but their own literature covers them.
-
Training-time concerns of embedding models (contrastive pretraining, fine-tuning strategies, distillation) when treated as training infrastructure rather than retrieval substrate.
-
Web search engines and their APIs (Google, Bing, Brave, Tavily, Exa) when treated as undifferentiated services rather than retrieval engineering. The latter is covered briefly where it connects to agent tool use.
How to read this catalog
Part 1 (“The Narratives”) is conceptual orientation: the distinction between retrieval and memory, the retrieval stack and its layers, why hybrid search wins, why chunking is design rather than housekeeping, and how single-shot RAG evolved into agentic retrieval. Five diagrams sit in Part 1; everything in Part 2 is text and code.
Part 2 (“The Substrates”) is reference material organized by section. Each section opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates appear in the Fowler-style template established by the prior nine volumes.
Part 1 — The Narratives
Five short essays frame the design space for retrieval and knowledge engineering. The reference entries in Part 2 assume the vocabulary established here.
Chapter 1. Retrieval and Memory Are Different Disciplines
The vector store is a primitive shared by memory (Volume 6) and retrieval (this volume), and the surface similarity --- both store vectors, both expose k-NN search --- produces a recurring conflation. A team building “retrieval” wakes up worrying about session expiration and eviction policies, which are memory concerns; a team building “memory” wakes up worrying about chunking strategies and reranking, which are retrieval concerns. The disciplines are different even when the data structure is the same.
Memory is what the agent writes about itself. Conversation summaries, observed facts, declared preferences, the working state of an in-progress task. The writes come from the agent’s own runs, often as a side effect of completing some work; the reads come from subsequent runs of the same agent (or related agents) needing to recall what happened before. The engineering pressures are around write semantics (when to write, what to write, how to compact), freshness (memory of a stale conversation hurts more than helps), eviction (the memory budget is finite; old content must age out), and personalization (different agents and different users have different memory scopes). Volume 6’s entries reflect these pressures: Mem0, Zep, Letta, Redis-as-state, the typed-state primitives of LangGraph.
Retrieval is what the agent fetches from a corpus it consumes but doesn’t own. Product documentation, knowledge base articles, code repositories, customer records, regulatory filings, internal wiki pages, the open web. The writes happen out-of-band --- the corpus is updated by humans, by ingestion pipelines, by syncs from systems of record --- and the agent reads it without contributing to it. The engineering pressures are around relevance (the top-K returned must contain the actually-relevant documents), recall (the relevant documents must exist in the index in the first place), ranking quality (the order of returned results determines whether the answer appears above the fold), and corpus governance (which documents are indexed, who has permission to retrieve them, how stale documents are detected and refreshed). This volume’s entries reflect these pressures: hybrid search engines, embedding models, rerankers, document processors, query understanding patterns.
Two practical implications follow from the distinction. First, the tooling overlap is partial. A vector database (Qdrant, Weaviate, Pinecone) shows up in both volumes because the data structure is shared, but the dominant adjacent tooling differs: memory deployments need conversation-summarization and consolidation features; retrieval deployments need document-processing and reranking features. Choosing a substrate purely on “it supports vectors” underweights this; the right question is which adjacent capabilities the deployment actually needs. Second, the operational concerns differ. Memory deployments worry about write throughput and eviction; retrieval deployments worry about index freshness and ranking quality. The monitoring dashboards, the alerts that matter, the SLAs that get negotiated --- all of them differ between the two disciplines even when the underlying database is the same product.
With the distinction established, the rest of this volume treats retrieval as its own discipline with its own substrates, its own engineering decisions, and its own quality bar. Volume 6’s memory entries are referenced where they connect; the connection is real but the disciplines are not the same.
Chapter 2. The Retrieval Stack
Production retrieval systems are built from eight discernible layers, each its own engineering discipline with its own products, its own tradeoffs, and its own failure modes. Teams typically own three to five layers well and treat the rest as black boxes. The biggest quality wins almost always come from a layer the team currently treats as a black box.
The indexing side is offline. Raw documents (PDFs, Word files, HTML, scanned images, code, structured data) enter document processing, which parses them into structured representations --- extracting text, preserving table structure, recovering layout, handling OCR for scanned content. The output feeds chunking, which splits documents into the units that will become embedded vectors and indexed records. Chunks feed embedding, which converts them to dense vector representations using a chosen model. The resulting vectors, alongside the original text and metadata, feed indexing into the production store --- typically a hybrid store supporting both vector and inverted-index (BM25) retrieval, with metadata filtering for governance.
The querying side is online. A raw query (what the user or agent typed) enters query understanding, which rewrites, decomposes, expands, or otherwise transforms it before retrieval. The transformed query feeds retrieve top-K, where the hybrid store returns candidate documents --- typically 50 to 200 --- using a combination of BM25 lexical matching and dense vector similarity, with metadata filters applied. The candidate set feeds rerank top-N, where a cross-encoder or late-interaction model produces high-quality ranking over the smaller candidate pool. The reranked top-N feeds context assembly, which deduplicates, formats, and trims to fit within the LLM’s context window. The assembled context goes to the LLM, alongside the original query, for generation.
Each layer has its own canonical products. Document processing has Unstructured.io, LlamaParse, Marker, Docling, Reducto. Chunking has the strategies covered in Chapter 4 (fixed-size, semantic, hierarchical, contextual). Embedding has the model families covered in Section B (proprietary APIs from OpenAI, Cohere, Voyage; open-weights from BGE, E5, Jina, Nomic). Indexing has the hybrid engines covered in Section A (OpenSearch, Elasticsearch, Vespa) plus the vector-first stores from Volume 6 (Qdrant, Weaviate, Pinecone, Milvus). Query understanding has the patterns covered in Section F (HyDE, decomposition, multi-query). Retrieval has the algorithms (BM25, k-NN, hybrid fusion via RRF or weighted sum). Reranking has the cross-encoders and late-interaction models in Section C (Cohere Rerank, BGE Reranker, ColBERT). Context assembly has the orchestration frameworks in Section E (LlamaIndex, LangChain retrievers).
The most useful diagnostic move when retrieval quality is poor is to identify which layer is responsible. “The agent gave a bad answer” could be a generation problem (the LLM had the right context and produced a wrong answer), a context assembly problem (the right document was retrieved but didn’t make it into the context window), a reranking problem (the right document was in the candidate set but ranked below the noise), a retrieval problem (the right document existed in the index but the search didn’t surface it), an indexing problem (the right document wasn’t indexed because the pipeline failed silently), an embedding problem (the right document was indexed but its embedding didn’t match the query embedding well), a chunking problem (the right information was split across chunks neither of which retrieved well), a document processing problem (the right information was in the source document but didn’t survive parsing), or a query understanding problem (the user’s question, as typed, wasn’t the right search input). Each of these failures requires a different fix in a different layer; treating the whole stack as a black box and tuning prompts produces consistent disappointment.
Chapter 3. Hybrid Search Wins
The empirical consensus across 2024 and 2025, on essentially every public benchmark and replicated in countless production deployments, is that hybrid search --- lexical (BM25 or a learned sparse variant like SPLADE) combined with dense vector retrieval, with cross-encoder reranking on the candidate set --- outperforms either pure dense or pure sparse retrieval alone. Teams that ship dense-only or BM25-only are leaving measurable quality on the table, often by significant margins on the queries where their chosen approach has known weaknesses.
Pure sparse (BM25) excels at exact term matching: product codes, names, rare technical terms, and any query where the user’s exact tokens appear in the relevant document. The IDF (inverse document frequency) weighting that makes BM25 work also produces its canonical failure mode: vocabulary mismatch. A query for “car” doesn’t match a document about “automobile”; a query for “heart attack” doesn’t match “myocardial infarction”; a query in casual phrasing doesn’t match a document in formal phrasing. The mismatch isn’t fixable within the BM25 framework --- the algorithm operates on tokens, not meanings. For decades, query expansion (adding synonyms to the query) and document expansion (adding alternative phrasings to documents) were the workarounds. Dense embeddings, the modern workaround, do the same job better by operating on learned representations where semantic similarity is built in.
Pure dense (vector) retrieval excels at semantic similarity: paraphrase robustness, conceptual matching, cross-lingual capability when the embedding model supports it, and handling long natural-language queries where BM25’s bag-of-words assumption fails. The canonical dense failures are the mirror image of BM25’s. Rare terms produce poor embeddings because the model didn’t see them often during training; product codes, version numbers, and identifiers don’t embed meaningfully because they’re not the kind of tokens semantic embeddings are designed for; exact-term matches lose to semantically-similar near-matches in the top-K results, which is often the wrong outcome when the user wants exactly what they asked for.
The hybrid pattern uses both retrievers and combines the results. BM25 returns its top-K₁ (typically 50—100) candidates; the dense retriever returns its top-K₂ (typically 50—100); the two ranked lists are fused via Reciprocal Rank Fusion (RRF) or a weighted score combination. RRF is the most-cited fusion method because it requires no score calibration between the two retrievers: each document’s fused score is the sum of 1/(k + rank) terms across the rankers where it appears, with k typically 60. The fused candidate set, typically 100—200 unique documents after deduplication, then feeds a cross-encoder reranker.
Reranking is the single highest-leverage improvement in retrieval pipelines. A cross-encoder takes the query and a candidate document together and produces a relevance score; this is fundamentally more expressive than the bi-encoder (independent embeddings of query and document) used in dense retrieval, and far more expressive than the term-frequency scoring in BM25. The cost is computational: scoring N documents requires N forward passes through the cross-encoder model, versus one query embedding plus N vector comparisons for dense retrieval. The two-stage pattern --- cheap retrieval to get the candidate set, expensive reranking to order it --- reconciles the cost and the quality. Cohere Rerank, BGE Reranker, and Voyage Rerank are the production-ready cross-encoder options as of mid-2026; the quality lift from adding a reranker to a previously-unreranked pipeline is consistently substantial across deployments.
Late-interaction models (ColBERT and its successors) sit between bi-encoders and cross-encoders. They produce per-token embeddings for both query and document, then compute relevance via MaxSim aggregation (for each query token, find the best-matching document token; sum these matches). The model is more expressive than bi-encoders (it can attend to query-document interactions at the token level) and faster than cross-encoders (the document representations can be precomputed and indexed). ColBERTv2 and the ColPali variant (which handles document images directly, skipping text extraction for layout-heavy content) are the current production-relevant late-interaction models; they’re less universally adopted than cross-encoders but valuable for the cases where their efficiency-quality tradeoff fits.
The practical recommendation: ship hybrid retrieval plus cross-encoder reranking as the default for any serious RAG deployment. Pure dense retrieval is the default that production systems should grow out of as soon as quality matters more than implementation simplicity. Pure BM25 is the default that legacy enterprise search systems should grow into the hybrid pattern by adding dense retrieval and a reranker. The case for either pure approach is operational rather than technical: dense-only is cheaper to operate at scale because the candidate-set retrieval is uniform; BM25-only is cheaper to operate at scale because the infrastructure is mature and embedding models are an additional dependency. Neither operational argument changes the empirical fact that hybrid plus reranking produces better results.
Chapter 4. Chunking Is Design, Not Housekeeping
The default chunking strategy in most RAG tutorials is to split the document into fixed-size windows with fixed overlap (the canonical example: 512 tokens with 50-token overlap). The default is everywhere; production deployments outgrow it as soon as quality becomes a priority. Chunking strategy often matters more than embedding model choice; an upgrade from fixed-size to semantic or hierarchical chunking produces measurable retrieval quality improvement on most corpora, while an embedding model upgrade frequently produces no improvement visible above the noise. Chunking is design, not housekeeping.
Fixed-size chunking splits the document every N tokens regardless of content boundaries, with overlap to mitigate the splits. The strengths are operational: trivial to implement, predictable token budgets, uniform chunk sizes that simplify downstream processing. The weaknesses are quality: splits happen mid-sentence, mid-paragraph, mid-table; logical units (a single explanation, a complete code function, a coherent argument) get split across chunks that retrieve poorly because neither chunk contains the complete idea. On unstructured prose the damage is manageable; on structured content (technical documentation, legal contracts, scientific papers with explicit section structure) the damage is significant.
Semantic chunking splits at meaningful boundaries: paragraph breaks, section headings, topic shifts detected by an NLP heuristic or an LLM. The chunks are variable-sized, reflecting the natural structure of the document rather than a fixed window. The strengths are preserved logical units --- a section about “refund policy” ends up as one chunk that retrieves well for refund queries, rather than fragmented across multiple chunks none of which contains the full policy. The weaknesses are implementation complexity (the splitter is more sophisticated than “every 512 tokens”) and variable chunk sizes (some chunks are short, some long; downstream context-window budgeting needs more thought). For most production deployments with structured corpora, the implementation complexity is worth the quality gain.
Hierarchical chunking maintains parent-child relationships: small leaf chunks (e.g. paragraphs) are embedded and indexed for retrieval precision; larger parent chunks (e.g. full sections or articles) are returned to the LLM for context richness. The retrieval happens on the leaves --- small focused embeddings produce high-precision matches --- but the context provided to the LLM is the parent, which contains the leaf plus surrounding context the LLM may need to interpret it. The cost is storage overhead (both leaves and parents are stored) and retrieval logic complexity (the system must map from retrieved leaves to their parents). The benefit is that you get the precision of small-chunk retrieval and the context richness of large-chunk generation simultaneously --- the false tradeoff between the two collapses. LlamaIndex’s hierarchical node parser and LangChain’s parent document retriever both implement this pattern directly.
Contextual chunking, introduced by Anthropic in September 2024, prepends document-level context to each chunk before embedding. Each chunk’s text becomes: “This chunk is from the document titled X, in the section about Y. The chunk’s content is: [original chunk text].” The context is generated by an LLM per chunk (a cheap operation amortized over many retrievals) and the augmented text is then embedded. Anthropic’s reported result: approximately 49% reduction in retrieval failure rate on their benchmark, with a further reduction when combined with hybrid search and reranking. The cost is the per-chunk LLM call at indexing time and the additional tokens per chunk; the benefit is dramatic quality improvement on the cases where chunks lose their document context after splitting (which is most chunks in most corpora). Anthropic’s prompt-caching pricing makes the indexing cost manageable; the technique is broadly applicable and underadopted relative to its impact.
Practical guidance for chunking design. First, profile the corpus before choosing a strategy: are the documents prose, structured, or layout-heavy? Are they short or long? Do they have natural sectional structure? The right chunking strategy follows from the answer. Second, contextual chunking is the highest-leverage technique introduced since the original RAG papers; if the corpus has any document-level context worth preserving (which is most corpora), this technique pays for itself. Third, hierarchical chunking dissolves the precision-vs-context tradeoff and should be the default for any production deployment with non-trivial documents. Fourth, evaluate the chunking strategy with retrieval-level metrics (recall@K, NDCG against a labeled set) rather than end-to-end answer quality --- the chunking effects are easier to see at the retrieval layer where they originate, and end-to-end metrics conflate chunking quality with everything else.
Chapter 5. From RAG to Agentic Retrieval
Through 2023 the dominant retrieval pattern in LLM applications was single-shot RAG: the user query goes to the retriever, the retriever returns top-K documents, the LLM generates an answer using the retrieved context. This is the baseline that nearly every retrieval-augmented system starts from. The pattern has obvious limitations --- if the first retrieval is poor, the system has no recovery path; if the question is multi-part, a single retrieval can’t address all parts; if the answer requires synthesizing across multiple retrievals, the single-shot pattern can’t support it. Across 2024 and into 2025, two waves of evolution addressed these limitations: iterative or corrective RAG (retrieve, evaluate, re-retrieve if needed) and agentic retrieval (the agent reasons about retrieval as one of its available tools).
Iterative or corrective RAG adds a check stage between retrieval and generation. After the retrieval, an evaluator (often the same LLM, sometimes a separate critic) assesses whether the retrieved context is sufficient and relevant. If yes, generation proceeds. If no, the system retries --- with a reformulated query, with different retrieval parameters, or by falling back to a different corpus or tool (web search, for instance, as the fallback when the local corpus comes up empty). The pattern, sometimes called Self-RAG or Corrective RAG (CRAG) in the research literature, adds resilience: the system recovers from bad first-pass retrievals rather than dropping the user into a hallucinated answer.
Agentic retrieval is the more fundamental shift. Instead of treating retrieval as a fixed pipeline that runs before the LLM, the system exposes retrieval as a tool the agent can call. The agent receives the user’s question, reasons about what information it needs, formulates a query, calls the retrieval tool, examines the results, decides whether to query again with different terms, decides whether to call additional tools (web search, code execution, database lookup), and eventually synthesizes an answer using the accumulated information. The pattern is documented across Volumes 1—3: it’s the ReAct loop (Volume 1) using retrieval as one of its tools (Volume 3) with the skills (Volume 2) that tell the agent how to reason about retrieval.
The shift from single-shot to agentic retrieval has three observable consequences. First, query understanding gets dynamic: the agent can decompose complex questions into sub-queries, retrieve for each, and synthesize --- the decomposition emerges from the agent’s reasoning rather than being a separate pipeline stage. Second, retrieval failures get recoverable: the agent observes “none of these results look relevant” and tries again with different terms, or escalates to a different tool. Third, latency and token cost climb significantly: each retrieval round is a tool call that consumes time and tokens; an agent that retrieves five times per answer costs roughly five times what single-shot RAG costs. The cost is real and only worth paying when the quality lift justifies it; for simple factual queries, single-shot retrieval with a strong reranker often beats agentic retrieval on both cost and latency, with no quality difference visible.
The practical recommendation is to ship single-shot hybrid retrieval with reranking as the baseline, add corrective RAG for the failure-recovery case when retrieval failures are a significant fraction of user complaints, and move to agentic retrieval only when the workload genuinely benefits --- complex multi-part questions, research-style tasks that require iterative refinement, deployments where the agent has multiple information sources to choose among. The shift to agentic retrieval is real and important, but it’s a power tool with real costs, not a default to reach for. The same caveats from the multi-agent volume (Volume 9) apply: defaulting to the more elaborate pattern produces systems that are slower, more expensive, and harder to debug, with no quality lift to justify any of it for the cases that didn’t need the elaboration.
Where this volume connects to the rest of the catalog is at exactly this point. The retrieval engineering covered across the next eight sections --- hybrid search engines, embedding models, rerankers, document processing, RAG orchestration, query understanding patterns, knowledge graphs --- is what makes the retrieval tool worth calling. An agent calling a poorly-engineered retrieval tool gets poor results regardless of how sophisticated its reasoning is; an agent calling a well-engineered retrieval tool can produce results that single-shot RAG with the same components could never match. The retrieval discipline is the substrate; the agent layer is what makes the substrate dynamic. Both matter; this volume covers the substrate.
Part 2 — The Substrates
Eight sections follow. Each opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates are presented in the same Fowler-style template used by the prior nine catalogs.
Sections at a glance
-
Section A --- Hybrid search platforms
-
Section B --- Embedding models
-
Section C --- Rerankers and late interaction
-
Section D --- Document processing
-
Section E --- RAG orchestration
-
Section F --- Query understanding patterns
-
Section G --- Knowledge graphs and GraphRAG
-
Section H --- Discovery and benchmarks
Section A — Hybrid search platforms
OpenSearch, Elasticsearch, Vespa --- the engines that do both BM25 and vectors
Three platforms dominate the hybrid-search-engine category as of mid-2026. OpenSearch is the Apache 2.0 fork of Elasticsearch (forked in 2021 when Elastic changed licensing) with strong native support for k-NN search, neural search, and learning-to-rank, primarily stewarded by AWS but open to broader contribution. Elasticsearch remains the dominant enterprise search platform under Elastic’s SSPL+Elastic License (and re-added AGPL as a third licensing option in 2024), with ELSER (Elastic Learned Sparse EncodeR) as their answer to dense vectors via learned sparse representations. Vespa, originally Yahoo’s production search infrastructure and now Apache 2.0, takes a more programmable approach --- phased ranking, tensor support, native ColBERT-style late-interaction --- with a different design instinct than the Elastic/OpenSearch family.
All three support the canonical hybrid pattern (BM25 plus dense vectors, with reranking or fusion), all three have production deployments at significant scale, all three support metadata filtering for governance. The choice between them is more about operational fit (existing infrastructure, team familiarity, licensing posture) than about retrieval quality --- each can be configured to produce competitive results.
OpenSearch
Source: github.com/opensearch-project/OpenSearch (Apache-2; Java)
Classification Open-source hybrid search engine with native k-NN, neural search, and LTR.
Intent
Provide a fully open-source search and analytics engine that supports BM25, dense vector retrieval (k-NN), learned sparse retrieval, learning-to-rank, and the orchestration to combine them into hybrid search pipelines.
Motivating Problem
When Elastic changed Elasticsearch’s license to SSPL in 2021, AWS forked the open codebase and launched OpenSearch under Apache 2.0. The technical heritage is shared; the governance is different. For organizations that require Apache-2 licensing (cloud vendors, regulated industries, public-sector deployments) OpenSearch is the open path. For organizations that don’t have that constraint, the choice between OpenSearch and Elasticsearch is largely about ecosystem and team familiarity --- the technical surfaces have converged again over 2023—2025.
How It Works
OpenSearch supports the full hybrid search stack natively. BM25 is the default scoring algorithm on text fields. k-NN search is provided via the k-NN plugin, supporting HNSW, IVF, and exact search algorithms with multiple distance metrics (L2, cosine, inner product). The neural-search plugin handles embedding generation at ingest and query time via integration with external model servers (SageMaker, custom endpoints) or built-in model serving. Learned sparse retrieval (using models like SPLADE) is supported.
Hybrid queries combine BM25 and k-NN via the hybrid query type, which retrieves candidates from each retriever and fuses via score normalization or RRF. Search pipelines (introduced in 2.x) let teams configure multi-stage processing --- retrieve, rerank, post-process --- declaratively in the index configuration rather than at query time.
Learning-to-rank is provided via the LTR plugin, which integrates trained ranking models (LambdaMART, XGBoost-based) into the scoring pipeline. The pattern is familiar to enterprise search teams: train a ranking model on labeled query-document pairs (clicks, conversions, manual labels), serve it as part of the retrieval stack, A/B test against the previous ranking model. OpenSearch’s LTR plugin is the open-source successor to the Elasticsearch LTR plugin originally developed by OpenSource Connections.
When to Use It
Production search and RAG deployments needing fully-open-source licensing. AWS-native deployments where the managed OpenSearch service fits the architecture. Teams already familiar with the Elasticsearch DSL who need an open-license alternative. Cases where learning-to-rank is in scope and the LTR plugin matters.
Alternatives --- Elasticsearch for the original implementation under Elastic’s licensing. Vespa for the programmable, tensor-heavy alternative. Vector-first stores from Volume 6 (Qdrant, Weaviate, Pinecone, Milvus) when the BM25 side isn’t a strong requirement.
Sources
-
github.com/opensearch-project/OpenSearch
-
opensearch.org/docs/latest/search-plugins/hybrid-search/
Example artifacts
Schema / config.
// Hybrid search query: BM25 + k-NN with RRF fusion via a search
pipeline
POST /products/_search?search_pipeline=hybrid_rrf_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"description": "waterproof hiking boots for cold weather"
}
},
{
"neural": {
"description_embedding": {
"query_text": "waterproof hiking boots for cold weather",
"model_id": "my-embedding-model-id",
"k": 100
}
}
}
]
}
},
"size": 20,
"_source": ["id", "name", "description", "price"]
}
// Search pipeline definition (created once):
PUT /_search/pipeline/hybrid_rrf_pipeline
{
"phase_results_processors": [
{
"score-ranker-processor": {
"combination": {
"technique": "rrf",
"parameters": {"rank_constant": 60}
}
}
}
]
}
Elasticsearch
Source: github.com/elastic/elasticsearch (SSPL + Elastic License + AGPL; Java)
Classification Dominant enterprise hybrid search engine with ELSER learned-sparse and dense vectors.
Intent
Provide the original Elasticsearch implementation under Elastic’s licensing, with first-class support for the full hybrid retrieval stack: BM25, dense vectors, ELSER learned sparse retrieval, RRF fusion, and the broader Elastic Stack (Kibana, Logstash, Beats, observability) for operational visibility.
Motivating Problem
Elasticsearch remains the dominant enterprise search platform with the deepest ecosystem (integrations, certified deployments, training, consulting partners) and the most mature operational tooling. For organizations whose constraint is “what does our existing team already operate” rather than “what’s the most open license,” Elasticsearch is typically the answer. The 2024 re-addition of AGPL as a licensing option addressed some of the concerns that drove the OpenSearch fork, though Apache-2 strictness remains a meaningful differentiator for some deployments.
How It Works
Elasticsearch supports the same hybrid stack as OpenSearch (the two share heritage and have converged again in capabilities). The differentiator is ELSER --- Elastic Learned Sparse EncodeR --- an in-house learned sparse retrieval model trained by Elastic and deployed natively in Elasticsearch. ELSER produces sparse vector representations that combine with BM25 in a complementary way: ELSER captures semantic relationships through term expansion while remaining inverted-index-friendly, avoiding the operational complexity of dense vector indexes for the sparse-retrieval portion of the hybrid pattern.
Hybrid queries combine BM25, dense, and ELSER via the rrf retriever type, which automatically applies Reciprocal Rank Fusion across the constituent retrievers. The pattern that Elastic’s benchmarks consistently show working best: BM25 + dense + ELSER + reranker, with the four-component ensemble outperforming any subset. The cost is operational complexity (four pipelines to maintain) and the win is the highest-quality retrieval the platform can produce.
Elastic’s integration with reranking is first-class: built-in inference endpoints for Cohere Rerank, Voyage Rerank, and Elastic’s own rerank models, with the inference call wired into the retrieval pipeline rather than orchestrated in application code. The 2024—2025 “search AI” positioning makes RAG-style retrieval pipelines configurable through the platform rather than requiring custom orchestration code.
When to Use It
Enterprise deployments with existing Elasticsearch infrastructure. Cases needing the deepest commercial support, training, and partner ecosystem. Deployments where ELSER’s sparse-retrieval performance matters and replacing it with dense-only would degrade quality. Teams that prefer managed services through Elastic Cloud.
Alternatives --- OpenSearch for the Apache-2 fork. Vespa for the programmable alternative. Vector-first stores (Qdrant, Weaviate, etc.) for vector-dominant deployments where the BM25 side is less important.
Sources
-
github.com/elastic/elasticsearch
-
www.elastic.co/guide/en/elasticsearch/reference/current/learned-sparse-encoder.html
Vespa
Source: github.com/vespa-engine/vespa (Apache-2; C++ and Java)
Classification Programmable hybrid search platform with phased ranking and native ColBERT support.
Intent
Provide a search platform with deeper programmability than Elasticsearch or OpenSearch: tensor-native data model, phased ranking with custom expressions at each phase, first-class support for late-interaction (ColBERT-style) retrieval, and the operational maturity of having run Yahoo’s personalized news ranking at production scale for over a decade.
Motivating Problem
For deployments where retrieval quality and ranking sophistication matter more than ecosystem familiarity --- personalized recommendations, large-scale ad ranking, e-commerce search where every basis point of relevance affects revenue --- Vespa’s programmable ranking pipeline is materially more capable than the configuration-driven approaches in Elasticsearch and OpenSearch. The trade-off is the learning curve: Vespa’s mental model (documents-as-tensors, ranking-as-expressions, phased evaluation) is further from the SQL-or-Lucene-DSL mental model that most search engineers carry.
How It Works
Vespa’s data model is tensor-native: each document field can be a scalar, a string, a vector, or a higher-rank tensor. Embeddings are first-class; late-interaction representations (ColBERT’s per-token embeddings) are first-class; learning-to-rank features are computed during ranking rather than pre-computed and indexed. The model is more general than the document-as-set-of-fields model in Elasticsearch and OpenSearch, with corresponding additional complexity.
Ranking happens in phases. Phase 1 (first-phase ranking) runs on every retrieved candidate --- typically thousands --- with a cheap expression (BM25, simple vector similarity, basic feature combinations). Phase 2 (second-phase ranking) runs on the top-N from phase 1 --- typically hundreds --- with a more expensive expression that can include learned-to-rank models, cross-encoder calls via Vespa’s ONNX integration, or complex feature combinations. Phase 3 (global-phase ranking) runs on the final top results with the most expensive scoring. The phased model explicitly handles the cost-vs-quality tradeoff at the platform level rather than requiring application orchestration.
Native ColBERT support means late-interaction retrieval is a first-class index type rather than an external integration. Vespa’s ColBERT integration handles the per-token embeddings, the MaxSim aggregation, and the phased ranking pipeline integration with the rest of the retrieval stack. This is materially harder to implement well in Elasticsearch or OpenSearch.
When to Use It
Personalization, recommendations, and ranking-heavy applications where retrieval quality is a primary business metric. Deployments using ColBERT-style late-interaction retrieval where native support matters. Teams comfortable with Vespa’s programming model who need the additional ranking sophistication. Large-scale deployments where the operational efficiency of Vespa’s C++ core matters.
Alternatives --- Elasticsearch or OpenSearch when ecosystem familiarity matters more than ranking sophistication. Custom ranking infrastructure on top of vector-first stores when the deployment can be more focused.
Sources
-
github.com/vespa-engine/vespa
-
docs.vespa.ai/en/ranking.html
Section B — Embedding models
Proprietary APIs and open-weights families --- the dense representation layer
Two camps dominate the embedding model space as of mid-2026. The proprietary API camp --- OpenAI’s text-embedding-3 family, Cohere’s Embed v3, Voyage AI’s voyage-3 family, Google’s gemini-embedding, plus newer entrants --- provides hosted APIs with pay-per-token pricing, high quality, and minimal operational burden. The open-weights camp --- BAAI’s BGE family, Microsoft’s E5, Jina Embeddings, Nomic’s embed text, mxbai-embed-large --- provides downloadable models that run on local infrastructure, with quality competitive enough that the MTEB leaderboard regularly has open models in the top ranks.
The choice between camps is operational. Proprietary APIs are simpler to integrate and ship continuously updated models; open weights provide cost control at scale, data-residency control, and the ability to fine-tune on domain-specific data. Most production deployments end up with one or the other, occasionally both for different parts of the pipeline. The MTEB benchmark (Massive Text Embedding Benchmark, hosted on Hugging Face) is the de facto reference for current model quality across both camps.
Proprietary embedding APIs
Source: OpenAI text-embedding-3, Cohere Embed v3, Voyage AI voyage-3, Google gemini-embedding
Classification Hosted embedding API services with pay-per-token pricing.
Intent
Provide high-quality embedding models as managed API services, eliminating the operational burden of self-hosting embedding model inference at the cost of vendor lock-in and per-token API charges.
Motivating Problem
For most teams the right embedding model isn’t a competitive moat, just a substrate. Proprietary embedding APIs handle the model serving, scaling, updates, and quality improvements; the application code just calls an endpoint. The trade-off is vendor coupling --- the embeddings in the index are tied to the chosen model and provider --- and recurring per-token cost at scale.
How It Works
OpenAI text-embedding-3-large (3072 dimensions, supports dimension reduction to 256—3072 via Matryoshka representation) and text-embedding-3-small (1536 dimensions) are the OpenAI defaults; text-embedding-ada-002 remains widely deployed but superseded by the -3 generation. Cohere Embed v3 (1024 dimensions; multilingual variants available) emphasizes instruction-following (input_type parameter distinguishes search_query from search_document, which is meaningful for asymmetric retrieval). Voyage AI’s voyage-3-large (1024 dimensions) and voyage-code-3 (specialized for code) consistently rank near the top of MTEB. Google’s gemini-embedding integrates with the Gemini API stack.
Asymmetric encoding matters and is sometimes under-appreciated. Search queries and search documents have different distributions --- a 3-word query and a 200-word document don’t live in the same semantic space the same way. Models that distinguish “embed this as a query” from “embed this as a document” (Cohere’s input_type, OpenAI’s task-specific prompts) produce measurably better retrieval than models that don’t. When the embedding API exposes the distinction, use it; the default of “one embedding function for everything” leaves quality on the table.
Matryoshka representation learning, popularized by OpenAI’s text-embedding-3 family in 2024, lets a single embedding be truncated to lower dimensions while preserving most of its quality. A 3072-dimension embedding can be truncated to 1024 or 512 dimensions with predictable quality degradation; the trade-off lets teams adjust the storage/quality/cost balance per deployment. The capability is increasingly common across providers.
When to Use It
Production deployments where the operational simplicity of an API call outweighs the cost. Multi-tenant SaaS applications where the cost per tenant is small. Cases needing the latest model improvements without re-indexing infrastructure. Teams that need to ship retrieval quickly and treat the embedding model as an upgrade-able substrate.
Alternatives --- open-weights models for cost control at scale, data-residency requirements, or fine-tuning needs. Embedding-included managed search platforms (Elastic with ELSER, OpenSearch with managed embedding) when the integration matters more than the choice of model.
Sources
-
platform.openai.com/docs/guides/embeddings
-
docs.cohere.com/docs/embeddings
-
docs.voyageai.com/embeddings/
Example artifacts
Code.
from openai import OpenAI
client = OpenAI()
# Asymmetric encoding: queries and documents get the same model but
# (for some providers) different prompt prefixes or input types.
def embed_query(query: str, dimensions: int = 1024) ->
list[float]:
"""Embed a search query with Matryoshka dimension reduction."""
resp = client.embeddings.create(
model="text-embedding-3-large",
input=query,
dimensions=dimensions, # Matryoshka: truncate from 3072 to 1024
)
return resp.data[0].embedding
def embed_documents(documents: list[str], dimensions: int = 1024)
-> list[list[float]]:
"""Embed documents in a batch. Batching dramatically reduces
per-call overhead."""
resp = client.embeddings.create(
model="text-embedding-3-large",
input=documents,
dimensions=dimensions,
)
return [d.embedding for d in resp.data]
# Cohere with explicit asymmetric encoding
import cohere
co = cohere.Client()
query_emb = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query", # different from search_document!
).embeddings[0]
doc_embs = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document",
).embeddings
Open-weights embedding models
Source: BAAI BGE, Microsoft E5, Jina Embeddings, Nomic, mxbai (Hugging Face)
Classification Self-hostable embedding models with MTEB-leaderboard-competitive quality.
Intent
Provide downloadable embedding model weights that run on local infrastructure, eliminating per-token API charges and external dependencies at the cost of operational responsibility for serving the model.
Motivating Problem
At scale, embedding API charges add up. A corpus of ten million documents at 500 tokens each costs roughly $50 per million tokens (typical proprietary API pricing as of mid-2026), which is real money during initial indexing and during re-embedding after model upgrades. Self-hosted embedding models eliminate the per-token cost (replaced by fixed inference compute), keep documents on-premises (relevant for data-residency requirements), and can be fine-tuned on domain-specific data (sometimes producing substantial domain-specific quality lifts).
How It Works
BAAI’s BGE family (BAAI General Embedding) is the most-cited open embedding family, with BGE-M3 supporting multilingual embedding plus sparse and ColBERT-style multi-vector representations from the same model. Microsoft’s E5 family (E5, E5-mistral-7b-instruct, multilingual-e5) is comparable in quality, with instruction-following capability via prompt prefixes. Jina Embeddings v3 supports late-chunking and task-specific LoRAs (one model with task-specific adapters loaded per query type). Nomic Embed Text and mxbai-embed-large are competitive single-model options.
MTEB (Massive Text Embedding Benchmark) is the de facto reference. The leaderboard tracks dozens of models across multiple task types (retrieval, clustering, classification, semantic similarity). Open-weights models regularly occupy top positions on the retrieval task, often within a few percentage points of the best proprietary APIs. The leaderboard’s task diversity also surfaces model-specific strengths: a model that’s top-3 on retrieval but mid-pack on semantic similarity might be the right choice for a retrieval-heavy deployment.
Serving the model is its own engineering. Inference servers (vLLM, Text Embeddings Inference / TEI from Hugging Face, TGI for the generative variants) handle the throughput; GPU memory budgets determine batch sizes; quantization (int8, int4) trades a small quality loss for significant memory and latency reduction. For deployments below ~1B embeddings per day, the operational complexity often outweighs the API cost savings; above that threshold the calculus reverses.
When to Use It
Large-scale deployments where per-token API costs are material. Data-residency-constrained deployments where documents can’t leave the deployment’s network. Domain-specific deployments where fine-tuning produces measurable quality lifts. Research deployments where reproducibility requires fixed model weights.
Alternatives --- proprietary APIs for the operational simplicity. Managed embedding services (Hugging Face Inference Endpoints, Modal, Replicate) when the model is open but the operational burden of self-hosting isn’t.
Sources
-
huggingface.co/spaces/mteb/leaderboard
-
huggingface.co/BAAI/bge-m3
-
huggingface.co/intfloat/multilingual-e5-large
Section C — Rerankers and late interaction
Cross-encoder rerankers and ColBERT-style models --- the second stage that earns its cost
Reranking is the single highest-leverage improvement available in most retrieval pipelines. A cross-encoder reranker takes a query and a candidate document together, attends across both with full transformer machinery, and produces a relevance score that captures interactions a bi-encoder embedding (where query and document are embedded independently) cannot. The cost is computational: scoring N candidates requires N cross-encoder forward passes, versus one query embedding plus N vector comparisons for dense retrieval. The two-stage pattern --- cheap retrieval for the candidate set, expensive reranking for the final order --- reconciles the cost.
Two flavors of reranking exist. Cross-encoder rerankers (Cohere Rerank, BGE Reranker, Voyage Rerank) are the canonical form: a transformer scores query-document pairs. Late-interaction models (ColBERT, ColPali, JaColBERT) sit between bi-encoders and cross-encoders: per-token embeddings with MaxSim aggregation, precomputable document representations, lower latency than cross-encoders, higher quality than bi-encoders. Both flavors are productized; both are worth understanding.
Cross-encoder rerankers
Source: Cohere Rerank, BGE Reranker, Voyage Rerank, Jina Reranker
Classification Second-stage rerankers using full cross-attention over query-document pairs.
Intent
Provide cross-encoder rerankers that score query-document pairs with full transformer attention, producing higher-quality ranking over the candidate set returned by first-stage retrieval.
Motivating Problem
Dense retrieval (bi-encoder embeddings, k-NN search) is fast because it precomputes document embeddings and uses approximate nearest-neighbor algorithms at query time. The price is representational: query and document each get one vector, and similarity reduces to a single inner product. Cross-encoders are slow because they require running a transformer over each query-document pair at query time. The price buys representational power --- the cross-encoder attends across query and document together, capturing interactions that single-vector similarity can’t. The two-stage pattern uses the fast retriever to narrow the field, then the slow reranker to order the survivors.
How It Works
Cohere Rerank (rerank-english-v3.0, rerank-multilingual-v3.0) is the most-deployed managed reranker. The API takes a query and a list of candidate documents, returns the documents reordered by relevance score. Typical usage: retrieve top-100 candidates via hybrid search, send them to Cohere Rerank, take top-10 or top-20 for the LLM context. Latency is roughly 100—300ms for 100 candidates; cost is per-1K-search-units pricing that scales with input length.
BGE Reranker (BAAI/bge-reranker-v2-m3 and successors) is the open-weights equivalent, runnable on local GPU infrastructure with sub-100ms latency depending on batch size and model size. Voyage Rerank (rerank-2, rerank-2-lite) and Jina Reranker provide additional commercial options. The MTEB reranking benchmark tracks current quality across the field.
The integration shape is straightforward: a wrapper around the retrieval pipeline that takes the first-stage output, calls the reranker, and returns the reordered list. LlamaIndex, LangChain, and Haystack all have first-class reranker integrations; custom code is a few dozen lines. The quality lift from adding a reranker to a previously-unreranked pipeline is consistently substantial across deployments --- typical reports range from 10% to 50% improvement in retrieval-quality metrics depending on baseline.
When to Use It
Any production RAG deployment where retrieval quality matters more than the 100—300ms reranker latency. Cases where the first-stage retrieval returns 50—200 candidates and the LLM context needs 5—20 documents. Multi-domain or multilingual deployments where the reranker handles language variation that the first-stage retriever doesn’t.
Alternatives --- late-interaction models (ColBERT) for the lower-latency middle ground. Skip reranking entirely for latency-critical deployments where the first-stage retrieval is good enough. Custom-trained cross-encoders for domain-specialized cases where the off-the-shelf rerankers underperform.
Sources
-
docs.cohere.com/docs/rerank-2
-
huggingface.co/BAAI/bge-reranker-v2-m3
Example artifacts
Code.
import cohere
co = cohere.Client()
def hybrid_retrieve_and_rerank(query: str, top_k_first_stage=100,
top_n_rerank=20):
# Stage 1: hybrid retrieval (BM25 + dense) returns ~100 candidates
candidates = hybrid_retriever(query, top_k=top_k_first_stage)
# candidates is a list of dicts: [{"id": ..., "text": ...,
"score": ...}, ...]
# Stage 2: Cohere Rerank reorders by cross-encoder relevance
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["text"] for c in candidates],
top_n=top_n_rerank,
return_documents=False,
)
# Reorder candidates by Cohere's ranking
reranked = []
for result in rerank_response.results:
original = candidates[result.index]
original["rerank_score"] = result.relevance_score
reranked.append(original)
return reranked
# Typical pipeline shape:
# Stage 1 returns 100 candidates from BM25 + dense + RRF fusion
# Stage 2 reorders, returns top 20 with rerank_score attached
# Stage 3 (context assembly) takes top 5-10 for the LLM context
window
ColBERT and late-interaction models
Source: stanford-futuredata/ColBERT, vespa-engine ColBERT, ColPali
Classification Late-interaction retrieval and reranking with per-token embeddings.
Intent
Provide a middle ground between bi-encoders (single-vector embeddings, fast but limited representation) and cross-encoders (full cross-attention, expressive but slow): per-token embeddings with MaxSim aggregation, document representations precomputable and indexable, query-time computation lighter than cross-encoders.
Motivating Problem
Cross-encoders are too slow for first-stage retrieval and ideal for second-stage reranking; bi-encoders are fast enough for first-stage retrieval but representationally limited. Late-interaction models fill the middle: per-token embeddings (each token gets a vector instead of the whole document being one vector) with MaxSim aggregation (for each query token, find the best-matching document token; sum the matches) produce representations that capture finer-grained query-document alignment than bi-encoders, with retrieval cost that scales with average tokens per document but doesn’t require a transformer pass per query-document pair.
How It Works
ColBERT (originally Stanford’s work, now in ColBERTv2 with improvements to indexing efficiency) is the canonical late-interaction model. Each document is encoded into ~100—200 per-token vectors (one per content token after pruning); each query is encoded into ~32 per-token vectors. At retrieval time, the system computes, for each query token, the maximum cosine similarity over all document tokens (MaxSim); the document’s score is the sum of these max-similarities across query tokens. The aggregation captures cases where some query tokens match strongly to some document tokens that aren’t aligned positionally --- the kind of partial-match behavior that’s natural for retrieval.
Indexing cost is higher than single-vector embeddings: ~100x storage per document for the per-token vectors. Modern implementations (PLAID, Vespa’s native ColBERT support) use centroid-based pruning and compression to keep the index size manageable, typically within 5—10x of a single-vector index.
ColPali (introduced 2024) extends the late-interaction pattern to documents-as-images: instead of OCR-then-embed, the document image is encoded directly by a vision-language model with per-patch embeddings, retrieved by MaxSim against the query embedding. The result is dramatic for layout-heavy documents (tables, forms, scientific papers, scanned content) where OCR loses information that direct image encoding preserves. ColPali is more research-grade than production-ready as of mid-2026, but it’s the direction of late-interaction’s near-term evolution.
Vespa’s native ColBERT support and Ragatouille (a Python wrapper around ColBERT focused on RAG use cases) are the production-relevant integration paths. Standalone ColBERT serving is also possible but operationally heavier than the bi-encoder embedding model serving most teams already do.
When to Use It
Retrieval deployments where the latency-quality tradeoff of late interaction beats both bi-encoders and cross-encoders. Vespa-based deployments where the native ColBERT support is first-class. Document-image retrieval (ColPali) where OCR-then-text-retrieval loses too much information.
Alternatives --- cross-encoder reranking on top of bi-encoder retrieval for the more common deployment shape. Pure bi-encoder retrieval when the additional storage and serving complexity isn’t justified by quality gains.
Sources
-
github.com/stanford-futuredata/ColBERT
-
github.com/AnswerDotAI/RAGatouille
-
arxiv.org/abs/2407.01449 (ColPali)
Section D — Document processing
Unstructured.io, LlamaParse, and the newer entrants for high-fidelity parsing
The retrieval stack starts with documents and ends with answers. If the document-processing layer at the start drops information --- by misparsing a table, by collapsing a complex layout into a single column of text, by failing to OCR scanned pages --- the information loss propagates through every subsequent layer and shows up at the end as poor retrieval quality. Document processing is the layer most teams treat as a black box and the layer most worth investing in for deployments where the source content is anything beyond clean prose.
Two products dominate. Unstructured.io is the open-source standard, with broad format support (PDFs, Word, HTML, images, email, presentations) and an element-based output model (Title, NarrativeText, Table, ListItem, etc.) that downstream chunking can use intelligently. LlamaParse is LlamaIndex’s commercial document parser, LLM-driven for layout understanding and particularly strong on complex tables, academic papers, and financial documents. Newer entrants --- Marker, Docling (IBM), Reducto --- compete on quality, speed, and price points for specific use cases.
Unstructured.io
Source: github.com/Unstructured-IO/unstructured (Apache-2; Python) and commercial API
Classification Open-source document processing with element-based extraction.
Intent
Provide a unified document-processing library that handles PDFs, Word documents, HTML, images, emails, presentations, and many other formats, producing a structured element stream (Title, NarrativeText, Table, ListItem, etc.) that downstream chunking and indexing can consume intelligently.
Motivating Problem
Production RAG corpora typically contain a mix of document formats: PDFs (some text-native, some scanned), Word documents, HTML web pages, plain text files, email exports, presentation decks, CSV exports. Each format has its own parsing libraries; each parsing library has its own conventions; integrating five different libraries per format produces brittle pipelines. Unstructured.io’s answer is a unified API: pass any document, receive a standardized element stream that the rest of the pipeline can process uniformly.
How It Works
The library exposes partition() functions per format (partition_pdf, partition_docx, partition_html, etc.) plus a generic partition() that auto-detects. Each returns a list of Element objects with types (Title, NarrativeText, ListItem, Table, Image, Footer, Header) and metadata (page number, coordinates, parent section, language detection).
PDF parsing has multiple strategies. The fast strategy uses heuristics (PDFMiner-based) and works well on text-native PDFs. The hi_res strategy uses YOLOX-based layout detection to identify tables and complex structure; this is slower but produces structured output for layout-heavy documents. The ocr_only strategy is for scanned PDFs without embedded text, using Tesseract under the hood. The strategy is configurable per call.
Table extraction is a strength: detected tables are returned as Table elements with both HTML representations (for fidelity) and text representations (for embedding). For corpora with significant table content (financial reports, scientific papers, product specifications), this is dramatically better than the alternative of “flatten tables into text and hope the embedding picks up the structure.”
The commercial Unstructured Serverless API provides hosted processing with higher-quality models (better layout detection, better table extraction) than the open-source library; the trade-off is per-page pricing. Most production deployments use the open-source library for high-volume routine documents and the commercial API for the long-tail of difficult content.
When to Use It
Any RAG deployment with multi-format document inputs. Cases needing element-based output to drive intelligent chunking (e.g. chunk-by-section rather than chunk-by-tokens). Production pipelines where format coverage and operational simplicity matter more than top-tier quality on the difficult edge cases.
Alternatives --- LlamaParse for higher-quality LLM-driven parsing of complex documents. Marker for fast PDF-to-markdown when only PDFs are in scope. Custom format-specific libraries when the document mix is narrow.
Sources
-
github.com/Unstructured-IO/unstructured
-
docs.unstructured.io
Example artifacts
Code.
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
# Auto-detect format and process
elements = partition(
filename="financial_report_2025.pdf",
strategy="hi_res", # use layout detection for tables and complex
structure
languages=["eng"],
)
# elements is now a list of typed Element objects:
for el in elements:
print(f"{el.category}: {el.text[:80]}...")
# Output:
# Title: Annual Report 2025
# NarrativeText: This year saw continued growth in our core...
# Table: [HTML representation of the income statement]
# ListItem: Revenue increased by 23% year-over-year
# Title: Risk Factors
# NarrativeText: The following risk factors could materially
affect...
# Element-aware chunking: keep titles with their following content
def chunk_by_section(elements):
chunks = []
current_chunk = {"title": None, "content": []}
for el in elements:
if el.category == "Title":
if current_chunk["content"]:
chunks.append(current_chunk)
current_chunk = {"title": el.text, "content": []}
else:
current_chunk["content"].append(el.text)
if current_chunk["content"]:
chunks.append(current_chunk)
return chunks
LlamaParse and the LLM-driven document parsers
Source: cloud.llamaindex.ai (LlamaParse, commercial), and alternatives: Marker, Docling, Reducto
Classification LLM-driven document parsing with high-fidelity layout and table understanding.
Intent
Provide LLM-driven document parsing that handles complex layouts --- multi-column documents, nested tables, mathematical content, scientific papers, financial filings --- with fidelity that heuristic-based parsers struggle to match.
Motivating Problem
For complex documents where layout matters (academic papers with multi-column text and figures, financial filings with nested tables and footnotes, technical specifications with mathematical content), heuristic parsers like the default Unstructured strategy lose information. The pages render correctly to humans but the parsed text loses the structure. LLM-driven parsers solve this by passing page images to a multimodal model that produces structured output directly --- the model sees the document the way a human reader does and writes out the content with structure preserved.
How It Works
LlamaParse uses an LLM-driven pipeline: each page is rendered to an image, the image is passed to a multimodal model with a parsing prompt, the model produces markdown (or JSON) output with tables, formulas, and layout structure preserved. Pricing is per page; quality is high; latency is slow (seconds per page) relative to heuristic parsers (milliseconds per page).
Marker (open-source, vikparuchuri/marker on GitHub) is the open-weights analog: optimized for fast PDF-to-markdown conversion with layout-aware extraction, no LLM API dependency at inference time. Quality is comparable to LlamaParse on most documents, with the trade-off of self-hosted infrastructure. Marker has become a popular choice for teams that need LlamaParse-level quality without per-page API costs.
Docling (IBM, open-source as of 2024) is another open-weights option, with strong table extraction and integration into the IBM watsonx stack. Reducto is a commercial high-fidelity parser positioned for enterprise document processing (financial services, legal, healthcare) where the per-page cost is justified by document criticality.
The decision among these alternatives is operational: per-page commercial pricing vs self-hosted compute, quality on the team’s specific document mix, latency budgets, ecosystem fit (LlamaParse integrates natively with LlamaIndex; Docling with watsonx; Marker stands alone). Profile representative documents against multiple parsers before committing.
When to Use It
Document corpora where layout matters and heuristic parsing loses information: academic papers, financial filings, technical specifications, government documents, complex tables. Cases where the per-page cost of LLM-driven parsing is justified by the document’s value. Hybrid pipelines where simple documents use heuristic parsers and complex documents use LLM-driven parsers selectively.
Alternatives --- Unstructured.io for the multi-format unified-API case. Format-specific parsers when the document type is narrow. Manual extraction when the corpus is small enough that human-in-the-loop processing is cheaper than building automation.
Sources
-
cloud.llamaindex.ai
-
github.com/VikParuchuri/marker
-
github.com/DS4SD/docling
Section E — RAG orchestration
LlamaIndex and LangChain retrievers --- the frameworks that compose the stack
Two frameworks dominate the orchestration layer that connects retrieval components into runnable pipelines. LlamaIndex is the retrieval-first framework: its core abstractions (Index, Retriever, QueryEngine, Node) are organized around the retrieval workflow, and the framework’s opinionation reflects years of focused investment in the RAG shape. LangChain is the general-purpose framework whose retriever abstractions are part of a broader agent toolkit; the retrievers are interchangeable Runnables that compose with the rest of LangChain’s LCEL.
Choosing between them is partly about retrieval depth (LlamaIndex has more retrieval-specific abstractions; LangChain has more agent-and-tool abstractions) and partly about ecosystem fit (which framework the rest of the application uses). Most production deployments end up with one as primary and the other as occasional dependency; the abstractions translate cleanly enough that switching costs are bounded.
LlamaIndex
Source: github.com/run-llama/llama_index (Python; MIT)
Classification Retrieval-first framework with native abstractions for the full RAG stack.
Intent
Provide a Python framework whose core abstractions are organized around retrieval: documents become nodes; nodes feed indexes; indexes produce retrievers; retrievers compose into query engines; query engines compose into agents. The framework’s opinionation matches the shape of production RAG.
Motivating Problem
Production RAG pipelines have many moving parts: document parsing, chunking with different strategies for different content types, embedding with chosen models, indexing in chosen stores, hybrid retrieval with optional filters, reranking with cross-encoders, query transformation, context assembly with deduplication, and final generation with citations. Each part has its own products and its own decisions; composing them ad-hoc produces fragile pipelines. LlamaIndex’s answer is a framework whose abstractions match the pipeline’s shape directly, with first-class support for each layer and well-defined seams between them.
How It Works
Documents are loaded via SimpleDirectoryReader or one of many LlamaHub loaders. Nodes (the chunks) are produced by NodeParsers (SentenceSplitter, HierarchicalNodeParser, SemanticSplitterNodeParser, MarkdownNodeParser, CodeSplitter). Embeddings are produced by Embedding objects (OpenAIEmbedding, HuggingFaceEmbedding, CohereEmbedding, VoyageEmbedding). The result is indexed via Index objects (VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex, ComposableGraph) backed by configured vector stores.
On the query side, the index produces a Retriever (which can be tuned with similarity_top_k, filters, retrieval mode --- hybrid, dense, sparse). The retriever feeds a QueryEngine that may apply postprocessing: NodePostprocessor objects implement reranking (CohereRerank, FlashRankRerank, SentenceTransformerRerank), deduplication, similarity filtering, time-decay weighting. The QueryEngine composes with response synthesizers (CompactAndRefine, TreeSummarize, Refine) that handle how retrieved content fits into the LLM’s context window.
The agent layer (FunctionAgent, ReActAgent, the newer Workflow-based agents) treats query engines as tools. A multi-tool agent might have one query engine per knowledge source plus utility tools; the agent reasons about which tool to call for which question. This is the agentic retrieval pattern from Chapter 5 implemented in framework idiom.
Hierarchical retrieval is a first-class pattern. HierarchicalNodeParser produces parent-child node trees; AutoMergingRetriever retrieves leaf nodes for precision and returns ancestors when enough leaves come from the same ancestor (the parent-child idea from Chapter 4 implemented). This is the closest production-ready packaging of the parent-child chunking strategy.
When to Use It
Production RAG deployments where the retrieval pipeline is the dominant complexity. Cases needing the full stack (parsing through reranking through synthesis) under one framework. Hierarchical retrieval and other advanced patterns. Teams that prefer retrieval-first abstractions over general agent frameworks.
Alternatives --- LangChain when retrieval is part of a larger agent toolkit. Custom code when the framework abstractions don’t fit the deployment. Haystack 2.x for pipeline-DAG-style composition, though its production momentum has slowed relative to LlamaIndex and LangChain.
Sources
-
github.com/run-llama/llama_index
-
docs.llamaindex.ai
Example artifacts
Code.
# Production hierarchical RAG with reranking in LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader,
Settings
from llama_index.core.node_parser import HierarchicalNodeParser,
get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.llms.anthropic import Anthropic
from llama_index.vector_stores.qdrant import QdrantVectorStore
Settings.llm = Anthropic(model="claude-opus-4-7")
Settings.embed_model = VoyageEmbedding(model_name="voyage-3",
voyage_api_key=...)
# 1. Load and parse with hierarchical chunking
docs = SimpleDirectoryReader("./corpus").load_data()
parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048,
512, 128])
nodes = parser.get_nodes_from_documents(docs)
leaf_nodes = get_leaf_nodes(nodes)
# 2. Index leaves in Qdrant
vector_store = QdrantVectorStore(client=...,
collection_name="corpus")
index = VectorStoreIndex(leaf_nodes, vector_store=vector_store,
storage_context=...)
# 3. AutoMerging retriever: retrieve leaves, merge into parents when
warranted
base_retriever = index.as_retriever(similarity_top_k=20)
retriever = AutoMergingRetriever(base_retriever,
storage_context=..., verbose=True)
# 4. Add cross-encoder reranking
rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-v2-m3",
top_n=5)
# 5. Compose query engine with reranking postprocessor
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[rerank],
)
response = query_engine.query("What does the return policy say about
damaged items?")
LangChain retrievers
Source: github.com/langchain-ai/langchain (Python and TypeScript; MIT)
Classification General-purpose framework with retriever abstractions composable into LCEL chains.
Intent
Provide a retriever abstraction (BaseRetriever) that composes uniformly with the rest of LangChain’s ecosystem: any vector store can expose a retriever, retrievers compose via LCEL into chains, the chains plug into agents, and the agents plug into the broader LangChain runtime.
Motivating Problem
For teams whose application is fundamentally agent-shaped rather than retrieval-shaped --- the agent calls retrieval as one tool among many; the application has other concerns (tool use, multi-step reasoning, multi-modal inputs) competing for the framework choice --- LangChain’s general-purpose orientation fits better than LlamaIndex’s retrieval-first orientation. The retriever abstraction is the same conceptually but lives within a framework where retrieval is one capability among many.
How It Works
BaseRetriever is the abstract interface; concrete retrievers implement get_relevant_documents and aget_relevant_documents. Every vector store wrapped by LangChain (Qdrant, Weaviate, Pinecone, Milvus, pgvector, Chroma, and many more) exposes an as_retriever() method that returns a retriever bound to that store. Cross-cutting retrievers --- MultiQueryRetriever, EnsembleRetriever, ContextualCompressionRetriever, ParentDocumentRetriever, SelfQueryRetriever --- wrap base retrievers with the query-understanding patterns from Section F.
Composition uses LCEL (LangChain Expression Language): retrievers, prompt templates, chat models, and output parsers chain together with the pipe operator. A canonical RAG chain is: retriever | format_docs | prompt | model | output_parser. The same retrievers plug into agent tools when the application is agent-shaped, and into batch processing when it’s ETL-shaped.
The ParentDocumentRetriever implements the parent-child pattern from Chapter 4: small chunks are stored in the vector store, larger parent documents are stored in a document store, retrieval returns the parents whose children matched. SelfQueryRetriever uses an LLM to translate natural-language queries into structured queries with metadata filters --- the canonical pattern for retrieval over corpora with rich metadata.
Ecosystem fit is the dominant practical reason teams pick LangChain over LlamaIndex: if the application already uses LangChain for agents, tools, prompt management, or any other capability, the retrievers come along naturally; switching to a different retrieval framework would mean maintaining two frameworks for the same kind of work.
When to Use It
Applications where retrieval is one capability among many and the framework choice is driven by the broader application shape. LangChain agents that need retrieval as a tool. Multi-modal or multi-tool applications where retrieval composition with other capabilities matters. Teams already invested in LangChain.
Alternatives --- LlamaIndex when retrieval is the dominant concern. Direct vector-store SDKs when the framework adds no value. Haystack 2.x for pipeline-DAG composition.
Sources
-
python.langchain.com/docs/concepts/retrievers/
-
github.com/langchain-ai/langchain
Section F — Query understanding patterns
Rewriting, HyDE, decomposition, multi-query --- transformations between user input and retriever input
What the user typed is rarely the optimal input for the retriever. A casual question (“why is my account locked”) lacks the formal vocabulary the documentation uses (“account lockout policy after failed authentication attempts”); a compound question (“what are your refund and shipping policies for international orders”) bundles topics that retrieve better separately; a vague question carries less signal than a fleshed-out hypothetical answer would. The query understanding layer transforms the raw query into one or more retriever inputs designed to surface the right documents. The transformations are LLM-driven, cheap by LLM standards, and consistently improve retrieval quality at modest latency cost. They are also the layer most teams skip in their first production deployment, which makes them the cheapest improvement available to most teams.
Query transformation patterns (implementable across frameworks)
Source: Implementable in LlamaIndex, LangChain, Haystack, or directly against any retriever
Classification Five transformation patterns: rewrite, expand, HyDE, decompose, multi-query.
Intent
Improve retrieval quality by transforming user input before retrieval: cleaning noisy inputs (rewrite), broadening vocabulary (expansion), bridging vocabulary mismatch (HyDE), breaking compound questions (decomposition), or hedging against any single phrasing being wrong (multi-query).
Motivating Problem
The vocabulary mismatch between user queries and corpus documents is the most persistent failure mode in retrieval. Users speak in casual, fragmented, sometimes ungrammatical English; documents are written in formal, complete, often jargon-heavy prose. Dense embeddings help (semantic similarity bridges some of the gap) but not always enough; BM25 fails on it directly. The query transformation layer addresses the mismatch by adapting the query rather than relying on the retrievers to handle the raw input.
How It Works
Rewriting is the cheapest pattern. An LLM cleans the query: fix typos, expand acronyms, remove conversational filler (“um, I was wondering if maybe”), and produce a clean reformulation that the retriever sees instead of the raw input. The cost is one LLM call; the quality lift is consistent on noisy real-world inputs. This should be the default for any production system handling natural user input.
HyDE (Hypothetical Document Embeddings, Gao et al. 2022) inverts the embedding direction. The LLM generates a hypothetical answer to the query --- a fake document that, if it existed, would answer the question well --- and the system embeds the hypothetical answer and retrieves documents similar to it. The intuition: answer-to-document similarity is often higher than question-to-document similarity, because answers and documents share vocabulary in ways questions and documents don’t. HyDE is counterintuitive but consistently effective when the user-corpus vocabulary mismatch is severe.
Decomposition splits compound questions into atomic sub-queries. “What are your refund and shipping policies for international orders?” becomes “What is the refund policy for international orders?” and “What is the shipping policy for international orders?”, each retrieved separately, with results combined in synthesis. The pattern fixes a specific failure: a single retrieval against a compound query often surfaces documents about one topic and not the other, because the embedding represents the compound rather than either constituent.
Multi-query generates several paraphrases of the original query (typically three to five), retrieves for each, and fuses the result sets via RRF or score combination. The pattern is cheap insurance against any single phrasing being wrong; if one paraphrase happens to align with corpus vocabulary, the relevant documents surface even if the original phrasing missed them. LangChain’s MultiQueryRetriever and LlamaIndex’s QueryFusionRetriever both implement this directly.
Most production systems combine the patterns: rewrite always (cheap insurance against noisy input), decompose for known compound queries (often gated by an intent classifier that detects compound structure), HyDE in domains with known vocabulary mismatch (medical, legal, technical specifications), multi-query for high-stakes questions where the retrieval cost is justified by the answer’s importance.
When to Use It
Any production retrieval system handling natural user input. Domains with significant vocabulary mismatch between users and corpus. Compound-question workflows. High-stakes retrieval where the cost of additional LLM calls is justified.
Alternatives --- raw queries when the user input is already optimal for retrieval (rare). Classical IR techniques (relevance feedback, query expansion via WordNet) when the LLM cost is prohibitive. Domain-specific query parsers when the queries follow a known grammar.
Sources
-
arxiv.org/abs/2212.10496 (HyDE)
-
python.langchain.com/docs/how_to/MultiQueryRetriever/
-
docs.llamaindex.ai/en/stable/examples/query_transformations/
Example artifacts
Code.
# Combining query transformations in LlamaIndex
from llama_index.core.query_engine import (
SubQuestionQueryEngine, # decomposition
HyDEQueryTransform,
TransformQueryEngine,
)
from llama_index.core.retrievers import QueryFusionRetriever #
multi-query
# 1. Decomposition: split compound queries into atomic sub-queries,
# retrieve each separately, synthesize across results
decomposing_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[
QueryEngineTool(query_engine=engine_for_refunds, metadata=...),
QueryEngineTool(query_engine=engine_for_shipping, metadata=...),
],
)
# 2. HyDE: generate hypothetical answer, retrieve against it
hyde_transform = HyDEQueryTransform(include_original=True)
hyde_engine = TransformQueryEngine(base_query_engine, hyde_transform)
# 3. Multi-query: paraphrase ensemble with RRF fusion
fusion_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=10,
num_queries=4, # generate 4 paraphrases
mode="reciprocal_rerank", # RRF across all queries x retrievers
use_async=True,
)
# 4. Compose: rewrite always, HyDE when high-stakes, decompose on
compound queries
def smart_query(raw: str) -> str:
cleaned = llm_rewrite(raw) # always rewrite
if is_compound(cleaned):
return decomposing_engine.query(cleaned)
if is_high_stakes(cleaned):
return hyde_engine.query(cleaned)
return base_query_engine.query(cleaned)
Classical IR techniques in the LLM era
Source: Lucene/Elasticsearch query DSL, learning-to-rank toolchains, Coveo and other enterprise search platforms
Classification Pre-LLM query understanding techniques still useful where domain structure is known.
Intent
Cover the classical information retrieval techniques --- query parsing, synonym expansion, learned-to-rank reranking, faceted filtering, intent classification --- that predate LLM-driven retrieval and remain valuable in domains where structured queries and curated synonyms outperform LLM transformations.
Motivating Problem
Decades of IR research produced robust techniques for query understanding that don’t require LLMs: WordNet-based synonym expansion, curated thesauri for domain vocabulary, learning-to-rank models trained on click data, faceted filters that narrow searches by structured attributes, intent classifiers that route queries to specialized handlers. These techniques are cheap, deterministic, auditable, and often complementary to LLM-driven transformations. Enterprise search platforms (Coveo, Lucidworks, Algolia) carry forward this discipline; their relevance engines combine classical techniques with embeddings as the LLM era catches up to where commercial search has been for years.
How It Works
Thesaurus and synonym expansion: curated mappings (“car” → [“vehicle”, “automobile”, “sedan”]) added to queries or documents at index time or query time. Modern variants generate the thesaurus via LLM (semantic-analysis-driven thesauri) but apply it via classical infrastructure. The 4-phase LLM-powered semantic analysis pipeline pattern --- generate thesaurus candidates with LLM, validate against corpus statistics, A/B-test in production, curate the survivors --- is a working hybrid that gets the maintainability of classical thesauri with the coverage of LLM-driven expansion.
Learning-to-rank: train a ranking model on click data and engagement signals. LambdaMART, XGBoost-on-pairwise-features, and the newer neural rankers all play this role. The model takes features (BM25 score, dense similarity, click history, recency, document quality scores, business rules) and produces a final relevance score. The technique is well-developed in commerce search and remains the production ranking layer in most large e-commerce deployments, with LLM-based reranking added on top for the high-stakes queries.
Faceted filtering and structured search: queries with structured constraints (category, price range, region, date) are parsed and applied as filters rather than free-text matching. The user types “red sneakers under $100”; the parser extracts color=red, type=sneakers, max_price=100, and applies them as filters before the relevance ranking runs. The technique is foundational in commerce search; the LLM-era variant uses LLMs as the parser (SelfQueryRetriever) but applies the parsed filters through classical infrastructure.
Intent classification: route queries to specialized handlers based on detected intent. “Where is my order” → order-tracking handler; “do you carry brand X” → catalog-search handler; “what’s your return policy” → policy-RAG handler. The classifier is small, fast, auditable; the handlers can be specialized for their intent. The pattern predates LLMs but extends naturally to LLM-driven intent classification when the intent taxonomy is large or fuzzy.
When to Use It
Enterprise and commerce deployments with established search infrastructure. Domains where curated vocabulary (medical, legal, technical) outperforms general LLM transformations. High-volume production search where LLM call latency or cost is prohibitive on every query. Hybrid pipelines that combine classical and LLM techniques.
Alternatives --- pure LLM-driven transformations for low-volume, high-value queries. Custom domain-specific query parsers for narrow grammars. The complementary use is the working pattern: classical techniques as the foundation, LLM transformations as the high-leverage additions for the queries where they earn their cost.
Sources
-
Manning, Raghavan, Schütze --- Introduction to Information Retrieval (Cambridge, 2008)
-
Coveo, Lucidworks, Algolia product documentation for the commercial state of the art
Section G — Knowledge graphs and graph-augmented retrieval
Neo4j with vector indexes, Microsoft GraphRAG, and the structured-knowledge alternative
For corpora where the relationships between entities matter as much as the entity content --- a regulatory compliance knowledge base where rules reference rules, a research corpus where papers cite papers, an enterprise wiki where pages link to pages --- retrieval that operates only on chunk content misses the structural signal that links carry. Knowledge graphs and graph-augmented retrieval address this by representing the corpus as nodes (entities, concepts, documents) and edges (relationships, citations, references), and retrieving over the graph structure as well as the content.
Two products dominate the 2025—2026 graph-augmented retrieval space. Neo4j is the established graph database that added vector indexes natively in 2024, enabling hybrid graph-and-vector queries. Microsoft GraphRAG (released summer 2024) is the LLM-driven approach: an LLM extracts entities and relationships from the corpus, builds a knowledge graph automatically, and retrieves via community-summary structures that capture both local content and global structure. The two reflect different philosophies: Neo4j is build-the-graph-yourself with the database as the substrate; GraphRAG is extract-the-graph-with-an-LLM with the pipeline doing the work.
Neo4j with vector indexes
Source: neo4j.com (commercial graph database with community edition)
Classification Established graph database extended with native vector indexes for hybrid graph-and-vector retrieval.
Intent
Represent the corpus as a property graph (nodes, relationships, properties), index node content with vector embeddings via native vector indexes, and retrieve via Cypher queries that combine graph traversal with vector similarity.
Motivating Problem
Pure vector retrieval flattens structure. A query for “which SOC 2 controls relate to access management” against a document corpus retrieves chunks about access management and chunks about SOC 2 separately; the structural relationship between specific SOC 2 controls and specific access management practices is lost in the chunking. A graph representation --- SOC 2 controls as nodes, access management policies as nodes, relationships (“implements”, “requires”, “references”) as edges --- captures the structure directly. Combined with vector indexes on the node content for semantic search, the result is a substrate where both content similarity and structural relationships are first-class.
How It Works
Model the corpus as a graph. Entities (controls, policies, documents, authors, departments) become nodes with labels and properties. Relationships (“IMPLEMENTS”, “REFERENCES”, “AUTHORED_BY”, “SUPERSEDES”) become edges with their own properties. The schema is the team’s explicit model of the domain --- designing it well is the dominant work.
Index node content with vector embeddings. Neo4j supports native vector indexes on node properties: CREATE VECTOR INDEX document_embedding FOR (d:Document) ON (d.embedding). Queries combine vector similarity with graph traversal: find documents semantically similar to a query, then traverse relationships to find related controls, then traverse again to find the access management policies those controls implement.
The hybrid query pattern is the strength. CYPHER (Neo4j’s query language) supports both the graph traversal idiom and vector similarity in the same query. A retrieval that uses both --- “find documents semantically similar to this question AND that have a path through IMPLEMENTS relationships to a specific compliance framework” --- expresses what would require multi-step orchestration in a pure-vector system as a single declarative query.
Integration with LLM frameworks is first-class. LangChain’s Neo4jVectorStore and Neo4jGraph; LlamaIndex’s Neo4jGraphStore and KnowledgeGraphIndex; native MCP servers exposing Neo4j as a retrieval substrate for agents. The graph-and-vector pattern slots into the broader agent retrieval workflow without disrupting it.
When to Use It
Corpora where entity relationships carry as much signal as entity content: compliance, regulatory, citation networks, organizational knowledge with explicit structure. Cases where structural queries (“what depends on what”, “what references what”) are a recurring need. Domains with curated taxonomies or ontologies that graphs represent naturally.
Alternatives --- Microsoft GraphRAG for the LLM-extracted-graph alternative. Pure vector retrieval when structure isn’t the dominant signal. Property graph extensions in document databases (CosmosDB Gremlin, ArangoDB) when the existing infrastructure leans that direction.
Sources
-
neo4j.com/labs/genai-ecosystem/
-
neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/
Example artifacts
Schema / config.
// Hybrid Cypher query: vector similarity + graph traversal
CALL db.index.vector.queryNodes('document_embedding', 5,
\$queryEmbedding)
YIELD node AS doc, score
MATCH (doc)-[:IMPLEMENTS]->(control:SOC2Control)
-[:RELATES_TO]->(policy:AccessPolicy)
WHERE control.framework = 'SOC2_2017'
RETURN doc.title, control.id, policy.name, score
ORDER BY score DESC
LIMIT 20
Microsoft GraphRAG
Source: github.com/microsoft/graphrag (MIT)
Classification LLM-driven knowledge graph extraction with community-summary retrieval.
Intent
Automatically extract entities and relationships from an unstructured corpus into a knowledge graph using an LLM, identify communities of densely-related entities, produce hierarchical summaries for each community level, and retrieve via the community structure for queries that span the whole corpus rather than matching specific chunks.
Motivating Problem
Traditional RAG works well for queries that map to specific documents (“what does the warranty section of the manual say”) and poorly for queries that span the corpus (“what are the major themes across customer complaints”). The latter requires global understanding of the corpus, not local retrieval of relevant chunks. GraphRAG’s answer is to build a graph capturing entities and relationships across the corpus, detect communities (densely-connected subgraphs) at multiple levels of granularity, produce LLM-generated summaries for each community, and retrieve from the summary hierarchy rather than from chunks. Local queries hit the leaves; global queries hit the roots; mid-level queries hit the appropriate community level.
How It Works
Indexing phase (offline, expensive): the LLM processes the corpus chunk by chunk, extracting named entities and their relationships into a knowledge graph. The graph is enriched with descriptions --- each entity and each relationship has an LLM-generated description summarizing its role. The Leiden community detection algorithm finds densely-connected subgraphs at multiple levels (level 0 = whole corpus; level 1 = major communities; level 2 = sub-communities; etc.). For each community at each level, the LLM produces a summary capturing what that community is about.
Querying phase (online): two retrieval modes. Local search retrieves entities semantically similar to the query, walks the graph from those entities to gather related context, and feeds the result to the LLM --- similar to traditional RAG but graph-aware. Global search routes the query against the community summary hierarchy, identifying which communities are relevant, retrieving their summaries, and synthesizing across them --- the answer reflects global structure rather than local chunk matches.
The trade-off is indexing cost. Building the graph requires LLM calls proportional to corpus size; for a million-document corpus the cost can be substantial. The trade-off is justified when the query workload includes global questions that traditional RAG handles poorly; it’s unjustified when all queries are local and traditional RAG works. The release of LightRAG and other lower-cost graph-RAG variants in 2024—2025 reduced the indexing cost gap; the architectural decision remains real.
When to Use It
Corpora where global queries (themes, patterns, summaries across many documents) are common. Research synthesis applications. Customer feedback analysis at scale. Multi-document reasoning where the answer requires connecting facts across documents traditional RAG would retrieve separately.
Alternatives --- Neo4j with vector indexes when the team has explicit schema to model. LightRAG and other low-cost graph-RAG variants when the indexing cost matters. Traditional RAG when queries are local.
Sources
-
github.com/microsoft/graphrag
-
microsoft.github.io/graphrag/
-
arxiv.org/abs/2404.16130
Section H — Discovery and benchmarks
MTEB, BEIR, and the resources for tracking the retrieval ecosystem
Retrieval has the rare advantage among AI capabilities of being objectively measurable. Decades of information retrieval research produced metrics that work: NDCG, MRR, recall@k, precision@k. Modern benchmarks (MTEB for embeddings, BEIR for retrievers, RAGAS for end-to-end RAG quality) put numbers on the choices teams make. The leaderboards capture current state; the awesome lists capture the ecosystem; the conference proceedings capture where it’s going. Tracking all three keeps the catalog’s recommendations honest as the field moves.
MTEB, BEIR, and benchmark resources
Source: huggingface.co/spaces/mteb/leaderboard, beir.ai, github.com/explodinggradients/ragas
Classification Benchmarks and leaderboards for the retrieval stack.
Intent
Provide standardized measurement of retrieval components --- MTEB for embedding models across 58+ tasks, BEIR for retrievers on a curated set of zero-shot retrieval benchmarks, RAGAS for end-to-end RAG quality --- with public leaderboards that update as new models and methods are released.
Motivating Problem
Retrieval is measurable in ways generation is not. The IR community produced rigorous metrics decades ago; the LLM era inherited them and extended them. MTEB (Massive Text Embedding Benchmark) measures embedding models across classification, clustering, retrieval, reranking, semantic textual similarity, summarization, and more. BEIR measures retrievers on zero-shot generalization across diverse tasks (open-domain Q&A, fact verification, biomedical retrieval, financial documents). RAGAS measures end-to-end RAG systems on faithfulness, answer relevancy, context precision, and context recall --- the multi-dimensional view of “does the RAG pipeline work” that production deployments need.
How It Works
MTEB: download the benchmark suite, run candidate embedding models against it, submit results to the leaderboard. The leaderboard tracks scores by task category and overall average; teams use it to select embedding models against tasks similar to their workload. The leaderboard updates as new models are released; current top-of-leaderboard models change every few months.
BEIR: similar pattern for retrievers (which produce ranked results) rather than just embedders (which produce vectors). The diverse zero-shot tasks reveal which retrievers generalize and which overfit to specific domains. BM25 remains a strong baseline; modern hybrid retrievers and learned dense retrievers compete above it.
RAGAS: the end-to-end measurement, complementary to MTEB and BEIR. Where MTEB and BEIR measure components, RAGAS measures the whole pipeline against the team’s actual data. The metrics (faithfulness, answer relevancy, context precision, context recall) are LLM-judged with the LLM-as-judge caveats from Volume 8 Chapter 4. The framework integrates with both LlamaIndex and LangChain.
Awesome lists and ecosystem resources: awesome-rag, awesome-information-retrieval, the LlamaIndex and LangChain documentation, conference proceedings (SIGIR, ECIR, ICTIR for the IR research community; NeurIPS, ICLR, ACL for the broader ML community). The combination tracks both products and research as both evolve.
When to Use It
Selecting embedding models, retrievers, or reranking models against benchmark performance. Validating that a deployed system’s end-to-end quality matches what its components’ benchmark scores would suggest. Tracking the state of the art as new models and methods are released.
Alternatives --- build your own evaluation against representative production data, which always beats any leaderboard for the team’s specific case. Combine: use benchmarks for the initial component selection, then build internal evals against production traces (Volume 8 Section C) for ongoing measurement.
Sources
-
huggingface.co/spaces/mteb/leaderboard
-
beir.ai
-
github.com/explodinggradients/ragas
-
sigir.org
Appendix A --- The Retrieval Stack Reference Table
Cross-reference between the eight layers of the retrieval stack (Chapter 2) and their representative substrates.
| Layer | Concern | Representative substrates |
|---|---|---|
| Document processing | Parsing, OCR, layout extraction | Unstructured.io, LlamaParse, Marker, Docling |
| Chunking | Splitting documents into retrieval units | Recursive, semantic, hierarchical, contextual |
| Embedding | Producing vectors for indexing | OpenAI text-embedding-3, Voyage, Cohere, BGE, E5, Jina |
| Indexing | Hybrid vector + BM25 + metadata | OpenSearch, Elasticsearch, Vespa, Qdrant, Weaviate |
| Query understanding | Transforming user input for retrieval | Rewrite, HyDE, decomposition, multi-query |
| Retrieval | Returning top-K candidates | BM25 + k-NN + RRF fusion |
| Reranking | Reordering candidates with cross-encoder | Cohere Rerank, BGE Reranker, Voyage Rerank, ColBERT |
| Context assembly | Deduplicating, formatting, fitting context window | LlamaIndex, LangChain retrievers, custom code |
Appendix B --- The Ten-Volume Series
This catalog joins the nine prior volumes to form a ten-layer vocabulary for agentic AI.
-
Volume 1 --- Patterns of AI Agent Workflows --- the timing of agent runs.
-
Volume 2 --- The Claude Skills Catalog --- model instructions in packaged form.
-
Volume 3 --- The AI Agent Tools Catalog --- the function-calling primitives.
-
Volume 4 --- The AI Agent Events & Triggers Catalog --- the activation layer.
-
Volume 5 --- The AI Agent Fabric Catalog --- the infrastructure substrate.
-
Volume 6 --- The AI Agent Memory Catalog --- the state and context layer.
-
Volume 7 --- The Human-in-the-Loop Catalog --- the human-agent interaction layer.
-
Volume 8 --- The Evaluation & Guardrails Catalog --- the governance layer.
-
Volume 9 --- The Multi-Agent Coordination Catalog --- the agent-to-agent communication layer.
-
Volume 10 --- The Retrieval & Knowledge Engineering Catalog (this volume) --- the discipline of finding the right information in a corpus.
Retrieval is the oldest discipline in this catalog. The other nine cover capabilities that became practical only with LLMs; retrieval has been a working engineering field since the 1960s and its core algorithms (BM25, learning-to-rank, faceted search) remain the production substrate. What the LLM era added is the dense-embedding layer on top, the LLM-driven query understanding patterns, the cross-encoder reranking that produces consistent quality lifts, and --- critically --- the agent layer that makes retrieval into a dynamic tool the agent reasons about rather than a fixed pipeline.
The cross-connection to the rest of the catalog is at exactly the agent-as-retrieval-consumer point. Volume 1’s patterns determine when retrieval runs. Volume 3’s tools include retrieval tools as a category. Volume 6’s memory uses similar infrastructure but serves a different discipline. Volume 7’s observability platforms trace retrieval calls alongside other tool calls. Volume 8’s evaluation includes retrieval-specific metrics (Ragas faithfulness, context precision) alongside generation metrics. Volume 9’s multi-agent systems often have one agent specialized for retrieval. Retrieval is not a layer; it’s a discipline that integrates with every layer above it.
Ten volumes. The series covers the working vocabulary of agentic AI as of mid-2026. The products will change; the structural vocabulary should hold up; that’s the catalog’s value proposition. An architect who internalizes the structure can map any new product onto the framework; an architect who learns only the products has to relearn the field every year.
Appendix C --- The Eight Retrieval Anti-Patterns
Eight recurring mistakes that distinguish careful retrieval engineering from improvised RAG pipelines. Avoiding these is most of the practical wisdom in the field:
-
Defaulting to pure dense retrieval. The 2022 design (one embedding model, one vector store, one k-NN search) underperforms hybrid (dense + sparse + reranking) on essentially every public benchmark and most production workloads. Teams that ship pure-dense are leaving measurable quality on the table; the migration to hybrid is one of the highest-leverage improvements available.
-
Treating chunking as housekeeping. Chunking strategy often matters more than embedding model choice. Fixed-size chunking with fixed overlap is the default of every tutorial and the wrong choice for nearly every structured corpus. Structure-aware chunking (by heading, by section, by code function) is the working baseline; hierarchical and contextual chunking are the production upgrades.
-
Skipping query understanding. What the user typed is rarely the optimal retriever input. Rewriting (cheap and effective), HyDE (for vocabulary mismatch domains), decomposition (for compound queries), and multi-query (for hedging against any single phrasing being wrong) are the working transformations. Most production systems skip them and ship measurably worse retrieval as a result.
-
Evaluating retrieval and generation together. “The RAG isn’t working” is a diagnostic dead end. The retrieval pipeline and the generation pipeline are two separable systems with separable failure modes; evaluate each independently. Most “RAG isn’t working” diagnoses resolve to retrieval failures (the right document never made it to context) that prompt engineering can’t fix.
-
Ignoring the document processing layer. Garbage in, garbage out is the oldest rule in computing and applies forcefully to retrieval. A corpus parsed badly --- tables flattened to gibberish, document structure lost, OCR errors propagated --- produces retrieval failures no downstream improvement can fully compensate for. Profile representative documents against multiple parsers; pick the one that preserves the structure the corpus has.
-
Conflating retrieval with memory. Volume 6’s memory disciplines and this volume’s retrieval disciplines share infrastructure but solve different problems. Memory deployments worry about eviction; retrieval deployments worry about ranking. Designs that mix the engineering pressures (eviction policies on knowledge bases, ranking debates on conversation summaries) produce both worse memory and worse retrieval.
-
Optimizing bottom-up. Most teams optimize the visible layer first: rerankers, prompts, generation. The leverage is top-down: corpus quality first, then chunking, then hybrid indexing, then query understanding. A team that spent six months tuning the reranker on a corpus that was chunked badly produced negligible improvement; a team that fixed chunking in two weeks produced a step-change. Optimize in leverage order.
-
Treating retrieval as a fixed pipeline forever. The 2023 single-shot RAG pattern works for simple cases and fails for complex ones. Agentic retrieval --- retrieval as a tool the agent calls, possibly multiple times, possibly with different queries, with reasoning between calls --- is the working pattern for hard cases. The transition is the same shift the rest of this catalog covers; retrieval is part of the agent design, not separate from it.
Appendix D --- Discovery and Standards
Resources for tracking the retrieval ecosystem as it evolves:
-
MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) --- the embedding-model state of the art.
-
BEIR (beir.ai) --- zero-shot retriever benchmarks.
-
RAGAS (github.com/explodinggradients/ragas) --- end-to-end RAG evaluation.
-
LlamaIndex documentation (docs.llamaindex.ai) --- the retrieval-first framework’s reference, often the clearest write-ups of new retrieval techniques as they emerge.
-
LangChain documentation (python.langchain.com) --- the general-purpose framework’s retriever reference.
-
Anthropic’s contextual retrieval research (anthropic.com/news/contextual-retrieval, September 2024) --- the canonical demonstration that chunking-with-context produces meaningful retrieval improvement.
-
Conference proceedings: SIGIR, ECIR, ICTIR for the IR research community; NeurIPS, ICLR, ACL for the broader ML community. arXiv cs.IR for the firehose.
-
Enterprise search vendors: Coveo, Lucidworks, Algolia, Elastic, OpenSearch --- their documentation and case studies capture the production state of the art in commerce and enterprise search.
Two practical recommendations. First, build retrieval evaluation against your own representative data from day one. Component benchmarks (MTEB, BEIR) tell you which embedding model and which retriever performed well on someone else’s data; only your evaluation tells you what works on yours. The standing investment in a held-out retrieval evaluation set, with curated ground truth and tracked metrics over time, is the single most useful piece of infrastructure for any serious retrieval deployment. Second, optimize in leverage order: corpus quality first, then chunking, then hybrid indexing, then query understanding, then reranking, then prompts. Most teams optimize in the reverse order and wonder why their RAG quality plateaus.
Appendix E --- Omissions
This catalog covers about 16 substrates across 8 sections. The wider retrieval ecosystem is significantly larger; a non-exhaustive list of what isn’t here:
-
Vector databases when treated as primary memory infrastructure --- covered in Volume 6 (Memory). The overlap is real; the framing distinction is whether the data structure is being used as the agent’s state or as the corpus.
-
Web-scale search APIs (Google Custom Search, Bing, Brave Search, Tavily, Exa, You.com, Serper) when treated as undifferentiated services. They’re relevant to agent retrieval but their own literature covers them; from a catalog perspective they’re tools (Volume 3) more than retrieval engineering.
-
Specialized retrieval for non-text modalities (CLIP for images, multimodal embeddings, video search). The patterns echo text retrieval but the substrate decisions differ enough that they warrant their own treatment.
-
Classical IR research depth (probabilistic retrieval, query likelihood models, neural ranking research beyond the production-relevant techniques covered).
-
Closed enterprise search platforms when used outside the LLM-augmented retrieval context: Coveo, Lucidworks, Sinequa, Algolia, Glean. Each is a substantial product with its own literature.
-
Domain-specific retrieval engineering for verticals: medical (UMLS, MeSH integration), legal (case law citation networks, statutory retrieval), code (the entire field of code search and retrieval-augmented code generation). The patterns generalize but the domain-specific resources are extensive.
Appendix F --- A Note on the Moving Target
Anthropic shipped contextual retrieval in September 2024 with a claim of roughly 49% retrieval failure reduction. Microsoft GraphRAG released in summer 2024 and LightRAG followed in 2025 as a lower-cost variant. Voyage AI was acquired by MongoDB in early 2024 and Cohere Rerank v3 shipped that year. BGE Reranker v2 emerged from BAAI. Neo4j shipped native vector indexes in 2024. OpenAI released text-embedding-3 models. The retrieval category moved fast across 2024—2025; the moves were additive (each new capability composes with existing infrastructure) rather than replacement (existing pipelines didn’t need to be rewritten).
The deepest structural facts are stable. Retrieval is a discipline distinct from memory, even when both use vector stores. The retrieval stack has eight layers with characteristic concerns at each. Hybrid search beats pure dense or pure sparse alone. Chunking is design, not housekeeping. Query understanding is the cheapest improvement most teams skip. Reranking provides consistent lift. Knowledge graphs add structural signal where corpus relationships matter. End-to-end evaluation should separate retrieval from generation. Retrieval evolved from a fixed pipeline to a tool the agent calls, and the agent layer is where this volume connects to the rest of the catalog.
Ten volumes. Patterns, Skills, Tools, Events, Fabric, Memory, Human-in-the-Loop, Evaluation & Guardrails, Multi-Agent Coordination, Retrieval & Knowledge Engineering. The vocabulary covers the working design space of agentic AI as of mid-2026. The products will keep moving. The structural vocabulary should hold up. The catalog’s value is in the structure, not in the products.
--- End of The Retrieval & Knowledge Engineering Catalog v0.1 ---