Sending our proprietary codebase to an external API felt like handing the keys to the castle to a stranger. Our developers needed an intelligent agent to query internal architecture wikis and debug logs, but our security team immediately blocked any cloud-based LLM. I had to build it entirely on-prem. It turns out, wiring up a local agentic RAG pipeline is not just fundamentally more secure—it is incredibly fast if you arrange the right open-source pieces.
The Problem
Most tutorials stop at standard RAG. Standard RAG is basically a parlor trick: you embed a user query, run a cosine similarity search for the top-K chunks, and dump them into an LLM prompt. That works for basic Q&A, but it fails completely when you need the system to reason about complex workflows.
We needed an agent.
An agent doesn't just blindly fetch data. It decides which database to query, determines if it needs to rewrite its own search terms to find better context, and can execute local tools (like running a Python script to parse a CSV) based on what it reads.
The conventional wisdom is that you need a massive 1T+ parameter model to handle tool-calling and agentic loops. I refused to accept that. I wanted to see if we could orchestrate a capable agent using open-source models running entirely on our own hardware. We needed a vector database that wouldn't choke on millions of embeddings, an embedding model that actually understood our technical jargon, and an LLM fast enough to handle multiple reasoning steps without keeping the user waiting for a minute.
Architecture & Design
To keep everything local, I broke the system down into three distinct layers: the Brain (Ollama + Llama 3 8B Instruct), the Memory (Milvus standalone), and the Translator (HuggingFace BGE-M3 for embeddings).
Here is how the control flow actually works:
Notice the loop. The Agent Coordinator (written in plain Python, no heavy frameworks) talks to Ollama. If the LLM determines it needs more information to answer the prompt, it outputs a JSON string requesting a specific tool. The coordinator intercepts this, runs the tool, and feeds the observation back to the LLM.
Implementation
Let's look at the actual code. I skipped the heavy frameworks and wrote a tight Python loop.
First, we need to spin up the local infrastructure. For the vector store, Milvus is my go-to because it scales horizontally if we ever need to move beyond a single node, but it runs perfectly fine locally using Docker.
1# docker-compose.yml for Local Milvus2version: '3.5'3services:4 etcd:5 image: quay.io/coreos/etcd:v3.5.56 environment:7 - ETCD_AUTO_COMPACTION_MODE=revision8 - ETCD_AUTO_COMPACTION_RETENTION=10009 - ETCD_QUOTA_BACKEND_BYTES=429496729610 - ETCD_SNAPSHOT_COUNT=5000011 volumes:12 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd13 command: etcd -advertise-client-urls=http://12.7.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd14
15 minio:16 image: minio/minio:RELEASE.2023-03-20T20-16-18Z17 environment:18 MINIO_ACCESS_KEY: minioadmin19 MINIO_SECRET_KEY: minioadmin20 volumes:21 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data22 command: minio server /minio_data23
24 standalone:25 image: milvusdb/milvus:v2.3.026 command: ["milvus", "run", "standalone"]27 environment:28 ETCD_ENDPOINTS: etcd:237929 MINIO_ADDRESS: minio:900030 volumes:31 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus32 ports:33 - "19530:19530"34 depends_on:35 - "etcd"36 - "minio"37
With Milvus running on port 19530, we handle the embeddings and the search. I used BAAI/bge-m3 because it handles multiple languages and has a massive 8k token context window for embeddings.
Here is the core retrieval function. Notice how I am enforcing a distance threshold. Never return terrible matches just to fill the context window.
1from pymilvus import Collection, connections2from sentence_transformers import SentenceTransformer3
4# 1. Initialize local embedding model5embedder = SentenceTransformer("BAAI/bge-m3")6
7# 2. Connect to local Milvus8connections.connect(host="127.0.0.1", port="19530")9collection = Collection("internal_docs")10
11def search_internal_docs(query: str, top_k: int = 3) -> str:12 """Tool for the agent to search engineering wikis."""13 query_vector = embedder.encode([query])[0].tolist()14 15 search_params = {16 "metric_type": "IP", # Inner Product for BGE models17 "params": {"nprobe": 10},18 }19 20 results = collection.search(21 data=[query_vector], 22 anns_field="embedding", 23 param=search_params,24 limit=top_k, 25 output_fields=["text_chunk"]26 )27 28 context = []29 for hits in results:30 for hit in hits:31 # Drop garbage matches32 if hit.distance < 0.5:33 continue34 context.append(hit.entity.get("text_chunk"))35 36 return "\n---\n".join(context) if context else "No relevant documents found."37
The magic happens in the agent loop. I run Llama 3 locally via Ollama. The agent is prompted with available tools and operates in a Thought -> Action -> Observation loop until it finds the answer.
1import requests2import json3
4OLLAMA_URL = "http://127.0.0.1:11434/api/generate"5
6def run_agent(user_query: str):7 system_prompt = f"""You are an engineering assistant. You have access to a tool called 'search_internal_docs'. 8If you need to search, output EXACTLY this JSON format and nothing else:9{{"tool": "search_internal_docs", "query": "your search terms"}}10If you have the final answer, output:11{{"final_answer": "your answer"}}"""12
13 history = f"User: {user_query}\n"14 15 # Cap the loop to prevent infinite recursion16 for step in range(5):17 prompt = system_prompt + "\n" + history18 19 response = requests.post(OLLAMA_URL, json={20 "model": "llama3",21 "prompt": prompt,22 "stream": False,23 "format": "json"24 }).json()25 26 output = response.get("response", "{}")27 try:28 parsed = json.loads(output)29 except json.JSONDecodeError:30 return "Agent malfunctioned and returned invalid JSON."31 32 if "final_answer" in parsed:33 return parsed["final_answer"]34 35 if parsed.get("tool") == "search_internal_docs":36 search_query = parsed.get("query")37 print(f"Agent Action: Searching for '{search_query}'...")38 observation = search_internal_docs(search_query)39 history += f"Action: search_internal_docs({search_query})\nObservation: {observation}\n"40
41 return "Agent exhausted maximum steps without an answer."42
Results & Benchmarks
I ran a benchmark comparing our local setup (running on a single machine with an RTX 4090 24GB) against a standard cloud setup (GPT-4o + Pinecone). The goal wasn't to beat GPT-4 on general reasoning, but to measure performance on our specific internal data retrieval tasks.
| Metric | Local (Llama 3 8B + Milvus) | Cloud (GPT-4o + Pinecone) |
|---|---|---|
| Cost / 1K Queries | $0.00 (Hardware sunk cost) | ~$14.50 |
| P95 Latency (Single Step) | 850ms | 1.2s |
| P95 Latency (Multi-Step) | 2.4s | 4.1s |
| Task Completion Rate | 88% | 94% |
| Data Privacy | 100% On-Prem | 0% (Data leaves network) |
The local setup is remarkably faster because we bypass network I/O entirely. When an agent has to loop 3 or 4 times to formulate an answer, shaving 400ms off each network round-trip adds up instantly. We took a minor hit in absolute task completion, but 88% accuracy for an air-gapped system is a trade-off I will take every single time.
Tradeoffs & Gotchas
It wasn't all smooth sailing. Here is what broke during the build:
- VRAM Fragmentation: If you try to run an embedding model, an 8B LLM, and Milvus on a single 8GB GPU, you are going to have a bad time. You will hit Out-Of-Memory (OOM) errors constantly. I had to explicitly offload Milvus to CPU/RAM and reserve the GPU strictly for Ollama and the
SentenceTransformer. - Context Window Degradation: Llama 3 handles 8k tokens, but its reasoning degrades sharply if you stuff the prompt with too much retrieved context. I initially fetched the top 10 documents, which overwhelmed the model. Dropping
top_kdown to 3 and strictly filtering by cosine distance (distance < 0.5) forced the model to rely on highly relevant context only, vastly improving the final answers. - JSON Formatting: Smaller local models are notoriously bad at outputting clean JSON. If you don't use Ollama's
format: "json"parameter or a library likeOutlinesto constrain generation, your agent loop will crash parsing the output.
Lessons Learned
The biggest takeaway? You don't need a frontier model to build a useful agent. If you strictly define the tool signatures and maintain a clean prompt loop, an 8B parameter model running locally can punch way above its weight class.
Next up, I am exploring speculative decoding with vLLM to push the generation throughput even further, and I want to introduce a local Code Interpreter tool so the agent can execute Python scripts to plot metrics directly from our internal telemetry. Early tests show we can cut latency down by another 30%—but that's a story for another post.