ahad.

Zero-Cloud Agentic AI: Running Milvus and Local LLMs On-Prem

AK
Ahad KhanAgentic AI Engineer
March 5, 2026
7 min read

Sending our proprietary codebase to an external API felt like handing the keys to the castle to a stranger. Our developers needed an intelligent agent to query internal architecture wikis and debug logs, but our security team immediately blocked any cloud-based LLM. I had to build it entirely on-prem. It turns out, wiring up a local agentic RAG pipeline is not just fundamentally more secure—it is incredibly fast if you arrange the right open-source pieces.

The Problem

Most tutorials stop at standard RAG. Standard RAG is basically a parlor trick: you embed a user query, run a cosine similarity search for the top-K chunks, and dump them into an LLM prompt. That works for basic Q&A, but it fails completely when you need the system to reason about complex workflows.

We needed an agent.

An agent doesn't just blindly fetch data. It decides which database to query, determines if it needs to rewrite its own search terms to find better context, and can execute local tools (like running a Python script to parse a CSV) based on what it reads.

The conventional wisdom is that you need a massive 1T+ parameter model to handle tool-calling and agentic loops. I refused to accept that. I wanted to see if we could orchestrate a capable agent using open-source models running entirely on our own hardware. We needed a vector database that wouldn't choke on millions of embeddings, an embedding model that actually understood our technical jargon, and an LLM fast enough to handle multiple reasoning steps without keeping the user waiting for a minute.

Architecture & Design

To keep everything local, I broke the system down into three distinct layers: the Brain (Ollama + Llama 3 8B Instruct), the Memory (Milvus standalone), and the Translator (HuggingFace BGE-M3 for embeddings).

Here is how the control flow actually works:

Loading diagram...

Notice the loop. The Agent Coordinator (written in plain Python, no heavy frameworks) talks to Ollama. If the LLM determines it needs more information to answer the prompt, it outputs a JSON string requesting a specific tool. The coordinator intercepts this, runs the tool, and feeds the observation back to the LLM.

Implementation

Let's look at the actual code. I skipped the heavy frameworks and wrote a tight Python loop.

First, we need to spin up the local infrastructure. For the vector store, Milvus is my go-to because it scales horizontally if we ever need to move beyond a single node, but it runs perfectly fine locally using Docker.

yaml
1# docker-compose.yml for Local Milvus
2version: '3.5'
3services:
4 etcd:
5 image: quay.io/coreos/etcd:v3.5.5
6 environment:
7 - ETCD_AUTO_COMPACTION_MODE=revision
8 - ETCD_AUTO_COMPACTION_RETENTION=1000
9 - ETCD_QUOTA_BACKEND_BYTES=4294967296
10 - ETCD_SNAPSHOT_COUNT=50000
11 volumes:
12 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
13 command: etcd -advertise-client-urls=http://12.7.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
14
15 minio:
16 image: minio/minio:RELEASE.2023-03-20T20-16-18Z
17 environment:
18 MINIO_ACCESS_KEY: minioadmin
19 MINIO_SECRET_KEY: minioadmin
20 volumes:
21 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
22 command: minio server /minio_data
23
24 standalone:
25 image: milvusdb/milvus:v2.3.0
26 command: ["milvus", "run", "standalone"]
27 environment:
28 ETCD_ENDPOINTS: etcd:2379
29 MINIO_ADDRESS: minio:9000
30 volumes:
31 - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
32 ports:
33 - "19530:19530"
34 depends_on:
35 - "etcd"
36 - "minio"
37

With Milvus running on port 19530, we handle the embeddings and the search. I used BAAI/bge-m3 because it handles multiple languages and has a massive 8k token context window for embeddings.

Here is the core retrieval function. Notice how I am enforcing a distance threshold. Never return terrible matches just to fill the context window.

python
1from pymilvus import Collection, connections
2from sentence_transformers import SentenceTransformer
3
4# 1. Initialize local embedding model
5embedder = SentenceTransformer("BAAI/bge-m3")
6
7# 2. Connect to local Milvus
8connections.connect(host="127.0.0.1", port="19530")
9collection = Collection("internal_docs")
10
11def search_internal_docs(query: str, top_k: int = 3) -> str:
12 """Tool for the agent to search engineering wikis."""
13 query_vector = embedder.encode([query])[0].tolist()
14
15 search_params = {
16 "metric_type": "IP", # Inner Product for BGE models
17 "params": {"nprobe": 10},
18 }
19
20 results = collection.search(
21 data=[query_vector],
22 anns_field="embedding",
23 param=search_params,
24 limit=top_k,
25 output_fields=["text_chunk"]
26 )
27
28 context = []
29 for hits in results:
30 for hit in hits:
31 # Drop garbage matches
32 if hit.distance < 0.5:
33 continue
34 context.append(hit.entity.get("text_chunk"))
35
36 return "\n---\n".join(context) if context else "No relevant documents found."
37

The magic happens in the agent loop. I run Llama 3 locally via Ollama. The agent is prompted with available tools and operates in a Thought -> Action -> Observation loop until it finds the answer.

python
1import requests
2import json
3
4OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
5
6def run_agent(user_query: str):
7 system_prompt = f"""You are an engineering assistant. You have access to a tool called 'search_internal_docs'.
8If you need to search, output EXACTLY this JSON format and nothing else:
9{{"tool": "search_internal_docs", "query": "your search terms"}}
10If you have the final answer, output:
11{{"final_answer": "your answer"}}"""
12
13 history = f"User: {user_query}\n"
14
15 # Cap the loop to prevent infinite recursion
16 for step in range(5):
17 prompt = system_prompt + "\n" + history
18
19 response = requests.post(OLLAMA_URL, json={
20 "model": "llama3",
21 "prompt": prompt,
22 "stream": False,
23 "format": "json"
24 }).json()
25
26 output = response.get("response", "{}")
27 try:
28 parsed = json.loads(output)
29 except json.JSONDecodeError:
30 return "Agent malfunctioned and returned invalid JSON."
31
32 if "final_answer" in parsed:
33 return parsed["final_answer"]
34
35 if parsed.get("tool") == "search_internal_docs":
36 search_query = parsed.get("query")
37 print(f"Agent Action: Searching for '{search_query}'...")
38 observation = search_internal_docs(search_query)
39 history += f"Action: search_internal_docs({search_query})\nObservation: {observation}\n"
40
41 return "Agent exhausted maximum steps without an answer."
42

Results & Benchmarks

I ran a benchmark comparing our local setup (running on a single machine with an RTX 4090 24GB) against a standard cloud setup (GPT-4o + Pinecone). The goal wasn't to beat GPT-4 on general reasoning, but to measure performance on our specific internal data retrieval tasks.

MetricLocal (Llama 3 8B + Milvus)Cloud (GPT-4o + Pinecone)
Cost / 1K Queries$0.00 (Hardware sunk cost)~$14.50
P95 Latency (Single Step)850ms1.2s
P95 Latency (Multi-Step)2.4s4.1s
Task Completion Rate88%94%
Data Privacy100% On-Prem0% (Data leaves network)

The local setup is remarkably faster because we bypass network I/O entirely. When an agent has to loop 3 or 4 times to formulate an answer, shaving 400ms off each network round-trip adds up instantly. We took a minor hit in absolute task completion, but 88% accuracy for an air-gapped system is a trade-off I will take every single time.

Tradeoffs & Gotchas

It wasn't all smooth sailing. Here is what broke during the build:

  1. VRAM Fragmentation: If you try to run an embedding model, an 8B LLM, and Milvus on a single 8GB GPU, you are going to have a bad time. You will hit Out-Of-Memory (OOM) errors constantly. I had to explicitly offload Milvus to CPU/RAM and reserve the GPU strictly for Ollama and the SentenceTransformer.
  2. Context Window Degradation: Llama 3 handles 8k tokens, but its reasoning degrades sharply if you stuff the prompt with too much retrieved context. I initially fetched the top 10 documents, which overwhelmed the model. Dropping top_k down to 3 and strictly filtering by cosine distance (distance < 0.5) forced the model to rely on highly relevant context only, vastly improving the final answers.
  3. JSON Formatting: Smaller local models are notoriously bad at outputting clean JSON. If you don't use Ollama's format: "json" parameter or a library like Outlines to constrain generation, your agent loop will crash parsing the output.

Lessons Learned

The biggest takeaway? You don't need a frontier model to build a useful agent. If you strictly define the tool signatures and maintain a clean prompt loop, an 8B parameter model running locally can punch way above its weight class.

Next up, I am exploring speculative decoding with vLLM to push the generation throughput even further, and I want to introduce a local Code Interpreter tool so the agent can execute Python scripts to plot metrics directly from our internal telemetry. Early tests show we can cut latency down by another 30%—but that's a story for another post.