RAG Document Agent

java · spring ai · pgvector · ollama · groqgithub

An AI agent that answers questions from your own documents using Retrieval Augmented Generation (RAG) — Java + Spring AI + pgvector + Ollama + Groq.

No fine tuning. No retraining. Drop documents in, ask anything.

The problem

LLMs know what they were trained on. They know nothing about your company policies, internal docs, or private knowledge base.

RAG fixes this — give the LLM your documents at query time. It answers from YOUR content, not its training data.

How it works

The system has two phases — indexing happens once, querying happens every time someone asks a question.

INDEXING (once at startup)

  .txt files in /docs folder
              │
              ▼
  ┌─────────────────────────┐
  │   Chunking               │
  │   Split into 500 char   │
  │   chunks, 100 char      │
  │   overlap between them   │
  └────────────┬────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   Embedding (Ollama)     │
  │   Each chunk →           │
  │   768-dimensional vector │
  └────────────┬────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   Storage (pgvector)     │
  │   Vectors + original    │
  │   text stored in        │
  │   Postgres               │
  └─────────────────────────┘
QUERYING (every question)

  "What is the leave policy?"
              │
              ▼
  ┌─────────────────────────┐
  │   Embed Question         │
  │   Ollama converts →      │
  │   query vector           │
  └────────────┬────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   Vector Search          │
  │   pgvector finds 3      │
  │   most similar chunks   │
  └────────────┬────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   Context Injection      │
  │   3 chunks injected     │
  │   into LLM prompt       │
  └────────────┬────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   Answer (Groq)          │
  │   LLM reads context →   │
  │   answers from YOUR     │
  │   documents              │
  └─────────────────────────┘

Real output

USER:  What is the leave policy?

AGENT: All full-time employees are entitled to
       18 days annual leave. Part-time employees
       receive leave on a pro-rata basis.
       Up to 5 days can be carried forward
       with manager approval.

USER:  What are the working hours?

AGENT: Standard working hours are 9 AM to 6 PM
       Monday to Friday. Core hours where all
       must be available are 10 AM to 4 PM.

USER:  What is the remote work policy?

AGENT: Employees can work remotely up to 3 days
       per week. At least 2 days must be spent
       in the office.

Zero of this was in the LLM's training data. All of it came from the documents we dropped in.

Architecture decisions

Why split into chunks instead of embedding whole documents?

Embedding a 2000 word document produces one vector — specific facts get diluted into the average. A 500 char chunk about "annual leave: 18 days" produces a focused vector that retrieves precisely when asked about leave. Chunk size is a tradeoff: smaller = more precise, larger = more context.

Why Ollama for embeddings and Groq for chat?

Groq only supports chat models — no embedding API. Ollama runs embedding models locally — free, private, no API limits. Two providers, two jobs, no overlap.

Why pgvector over a dedicated vector DB like Pinecone?

pgvector is a Postgres extension. You already have Postgres. No new infrastructure, no new operational burden. For under 1 million vectors pgvector performance matches dedicated vector DBs. Right tool for the scale.

Why 100 char overlap between chunks?

Documents have sentences that cross chunk boundaries. Without overlap: "...leave is 18 days. Medical certificate..." gets split — each chunk loses half the context. With overlap: both chunks share the boundary sentence. Retrieval accuracy improves significantly.

The data privacy consideration

Document chunks are sent to Groq (external) when forming answers. For sensitive internal documents — use Ollama for chat too, or deploy on a private LLM. Embeddings stay local (Ollama) — only the retrieved chunks hit the external LLM.

What this demonstrates

  • RAG pattern — retrieve relevant context, then generate answer
  • Embedding and vector search — semantic similarity, not keyword match
  • Chunking strategy — why size and overlap matter
  • Multi-provider Spring AI — Ollama for embeddings, Groq for chat
  • pgvector — vector search inside Postgres, no extra infra

Tech stack

Layer Technology
Language Java 17
Framework Spring Boot 3.3.4
AI integration Spring AI 1.0.0-M6
LLM (chat) Llama 3.3 70B via Groq
Embeddings nomic-embed-text via Ollama
Vector store pgvector (Postgres extension)
Build Maven

What's next

  • REST API endpoint — accept questions via HTTP, not just startup
  • Multi-format support — PDF, Word docs, not just .txt
  • Hybrid search — combine vector similarity with keyword search
  • Conversation memory — multi-turn questions that reference each other
  • Source citations — show exactly which document chunk answered the question
© 2026