Hybrid RAG Core

github

A production-style Retrieval-Augmented Generation pipeline built from scratch. Upload a PDF, ask questions, get answers grounded in your document.

Built to understand and demonstrate every step of the RAG pipeline — from PDF parsing to hybrid search to streaming LLM responses.

Architecture

PDF Upload → pymupdf parsing → recursive chunking → TF-IDF embedding
→ Pinecone vector storage → hybrid retrieval (dense + BM25 + RRF)
→ LLM reranking → Groq/Llama 3.1 streaming response
┌─────────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  PDF Upload  │────→│  Chunker │────→│ Embedder │────→│ Pinecone │
└─────────────┘     └──────────┘     └──────────┘     └──────────┘
                                                            │
┌─────────────┐     ┌──────────┐     ┌──────────┐          │
│  Groq LLM   │←────│ Reranker │←────│ Retriever│←─────────┘
│  (streaming) │     │ (LLM)    │     │ (hybrid) │
└─────────────┘     └──────────┘     └──────────┘

What each step does

1. PDF Parsing

Extracts text from PDFs using pymupdf (fitz) — the best free PDF parser. Handles multi-column layouts, tables, and complex formatting better than PyPDF2.

2. Chunking

Recursive character splitting — tries to split on natural boundaries in order: paragraphs → newlines → sentences → words. Keeps semantically related text together while respecting size limits.

3. Embedding

TF-IDF vectors with fixed dimensions. Converts text to numerical vectors based on word importance. Each word's score = how often it appears in this chunk × how rare it is across all chunks. Designed to be swapped for neural embeddings (BGE, OpenAI, Sentence Transformers) when available.

4. Vector Storage

Pinecone — cloud vector database with HNSW indexing. Vectors are stored permanently and searched in O(log n) time instead of brute-force O(n). Falls back to in-memory search if Pinecone is unavailable.

5. Retrieval

Hybrid search combining two approaches:

  • Dense search (Pinecone): finds semantically similar chunks via vector cosine similarity
  • Sparse search (BM25): finds keyword-matching chunks via term frequency scoring
  • Reciprocal Rank Fusion: merges both ranked lists — chunks that score high in both get boosted

6. Reranking

LLM-based reranking — after retrieving top 20 candidates, asks the LLM to score each chunk's relevance (0-10). Reorders by relevance score and returns top 5. A lightweight alternative to cross-encoder models.

7. Conversation Memory

Sliding window + auto-summarization:

  • Keeps last 6 messages in full
  • When history exceeds 12 messages, older messages are summarized by the LLM into 2-3 sentences
  • Summary + recent messages are included in every prompt
  • The LLM is stateless — memory is managed entirely in application code

8. Chat

Streaming SSE responses via Groq (Llama 3.1 8B). The prompt includes conversation memory + retrieved context + the user's question. Tokens stream to the frontend as they're generated.

Design Decisions

Why TF-IDF instead of neural embeddings?

TF-IDF requires zero external dependencies — no PyTorch, no model downloads, no GPU. It works offline on any machine. The architecture supports swapping in neural embeddings (BGE, Sentence Transformers, Ollama) by changing one class. TF-IDF + BM25 hybrid search provides surprisingly good retrieval for most use cases.

Why Pinecone over self-hosted?

Free tier with 2GB storage, no infrastructure to manage, HNSW indexing out of the box. The code falls back to in-memory search if Pinecone is unavailable, so it works without any cloud dependency.

Why LLM reranking instead of cross-encoder?

Cross-encoders (ms-marco-MiniLM) require PyTorch (~800MB). LLM reranking uses the same Groq API we already have for chat — no new dependencies. Trade-off: ~500ms extra latency vs ~20ms for a cross-encoder.

Why Groq over OpenAI?

Free tier, fast inference (Llama 3.1 8B), no cost for experimentation. The LangChain abstraction makes it trivial to swap providers.

jaswanthgollamudi@yahoo.com
GitHubLinkedInCV