A production-style Retrieval-Augmented Generation pipeline built from scratch. Upload a PDF, ask questions, get answers grounded in your document.
Built to understand and demonstrate every step of the RAG pipeline — from PDF parsing to hybrid search to streaming LLM responses.
Architecture
PDF Upload → pymupdf parsing → recursive chunking → TF-IDF embedding
→ Pinecone vector storage → hybrid retrieval (dense + BM25 + RRF)
→ LLM reranking → Groq/Llama 3.1 streaming response
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ PDF Upload │────→│ Chunker │────→│ Embedder │────→│ Pinecone │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘
│
┌─────────────┐ ┌──────────┐ ┌──────────┐ │
│ Groq LLM │←────│ Reranker │←────│ Retriever│←─────────┘
│ (streaming) │ │ (LLM) │ │ (hybrid) │
└─────────────┘ └──────────┘ └──────────┘
What each step does
1. PDF Parsing
Extracts text from PDFs using pymupdf (fitz) — the best free PDF parser. Handles multi-column layouts, tables, and complex formatting better than PyPDF2.
2. Chunking
Recursive character splitting — tries to split on natural boundaries in order: paragraphs → newlines → sentences → words. Keeps semantically related text together while respecting size limits.
3. Embedding
TF-IDF vectors with fixed dimensions. Converts text to numerical vectors based on word importance. Each word's score = how often it appears in this chunk × how rare it is across all chunks. Designed to be swapped for neural embeddings (BGE, OpenAI, Sentence Transformers) when available.
4. Vector Storage
Pinecone — cloud vector database with HNSW indexing. Vectors are stored permanently and searched in O(log n) time instead of brute-force O(n). Falls back to in-memory search if Pinecone is unavailable.
5. Retrieval
Hybrid search combining two approaches:
- Dense search (Pinecone): finds semantically similar chunks via vector cosine similarity
- Sparse search (BM25): finds keyword-matching chunks via term frequency scoring
- Reciprocal Rank Fusion: merges both ranked lists — chunks that score high in both get boosted
6. Reranking
LLM-based reranking — after retrieving top 20 candidates, asks the LLM to score each chunk's relevance (0-10). Reorders by relevance score and returns top 5. A lightweight alternative to cross-encoder models.
7. Conversation Memory
Sliding window + auto-summarization:
- Keeps last 6 messages in full
- When history exceeds 12 messages, older messages are summarized by the LLM into 2-3 sentences
- Summary + recent messages are included in every prompt
- The LLM is stateless — memory is managed entirely in application code
8. Chat
Streaming SSE responses via Groq (Llama 3.1 8B). The prompt includes conversation memory + retrieved context + the user's question. Tokens stream to the frontend as they're generated.
Design Decisions
Why TF-IDF instead of neural embeddings?
TF-IDF requires zero external dependencies — no PyTorch, no model downloads, no GPU. It works offline on any machine. The architecture supports swapping in neural embeddings (BGE, Sentence Transformers, Ollama) by changing one class. TF-IDF + BM25 hybrid search provides surprisingly good retrieval for most use cases.
Why Pinecone over self-hosted?
Free tier with 2GB storage, no infrastructure to manage, HNSW indexing out of the box. The code falls back to in-memory search if Pinecone is unavailable, so it works without any cloud dependency.
Why LLM reranking instead of cross-encoder?
Cross-encoders (ms-marco-MiniLM) require PyTorch (~800MB). LLM reranking uses the same Groq API we already have for chat — no new dependencies. Trade-off: ~500ms extra latency vs ~20ms for a cross-encoder.
Why Groq over OpenAI?
Free tier, fast inference (Llama 3.1 8B), no cost for experimentation. The LangChain abstraction makes it trivial to swap providers.