My cloud bill hit four figures before I realized something uncomfortable: most of my “AI infrastructure” was just GPU rental and API tax. Worse — sensitive internal data was flowing through third-party endpoints.
Why Go On-Prem?
So I rebuilt everything locally.
No OpenAI APIs. No managed vector DBs. No SaaS inference. Just open-source embeddings, local LLM inference, and an agent loop running entirely on-prem.
Here’s exactly how I architected and deployed a fully local agentic AI stack — embeddings, retrieval, tool use, orchestration — all running inside my own network.
The Problem
Running agentic AI in the cloud is easy.
Running it locally — with:
- private data
- deterministic infra
- predictable cost
- air-gapped deployment
— is not.
The hard parts aren’t the LLM calls. They’re:
- Embedding generation at scale
- Efficient local vector search
- Tool orchestration without SaaS glue
- Latency management without hyperscaler GPUs
And if you get embeddings wrong? Your agent becomes confidently useless.
I wanted:
- 100% local inference
- Open-source embedding model
- Local vector DB
- Multi-step agent loop with tools
- No outbound internet access
- Runs on a single GPU workstation (A100 preferred, 4090 acceptable)
Architecture
Here’s the high-level system design I ended up with:
Core Components
| Layer | Stack |
|---|---|
| LLM Inference | Ollama or vLLM |
| Embeddings | Milvus-compatible open-source embedding model |
| Vector Store | Milvus (standalone) |
| Agent Framework | LangGraph / Custom loop |
| Tool Execution | Python function router |
| Storage | Local filesystem or Postgres |
Everything runs inside Docker containers on the same host.
No external calls.
Step 1: Local LLM Inference
I evaluated:
- Ollama
- vLLM
- llama.cpp
I landed on vLLM for performance and batching support.
Why vLLM?
- Continuous batching
- Efficient KV cache reuse
- Lower p95 latency under concurrency
Setup
1docker run --gpus all -p 8000:8000 \2 vllm/vllm-openai \3 --model mistralai/Mistral-7B-Instruct-v0.2Now you get an OpenAI-compatible endpoint at:
1http://localhost:8000/v1/chat/completionsStep 2: Open-Source Embeddings (Milvus-Compatible)
Milvus works best with dense vector embeddings.
I used:
- BAAI/bge-large-en-v1.5
- Instructor-large (for instruction-tuned retrieval)
Both run locally via SentenceTransformers.
Embedding Service
1from sentence_transformers import SentenceTransformer2from fastapi import FastAPI3
4model = SentenceTransformer("BAAI/bge-large-en-v1.5")5app = FastAPI()6
7@app.post("/embed")8def embed(texts: list[str]):9 vectors = model.encode(texts, normalize_embeddings=True)10 return {"vectors": vectors.tolist()}Run it locally:
1uvicorn embedding_service:app --host 0.0.0.0 --port 9000Step 3: Milvus Vector Database (On-Prem)
Milvus standalone is straightforward:
1docker-compose up -dMinimal docker-compose.yml:
1version: '3.5'2services:3 milvus:4 image: milvusdb/milvus:v2.3.05 ports:6 - "19530:19530"7 - "9091:9091"Create Collection
1from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection2
3connections.connect("default", host="localhost", port="19530")4
5fields = [6 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),7 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),8 FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048)9]10
11schema = CollectionSchema(fields)12collection = Collection("knowledge_base", schema)Now your embeddings are fully local and indexed.
Step 4: Agent Loop (Local Tool Use)
This is where most tutorials fall apart.
An agent isn’t just “LLM + retrieval”. It’s:
- Thought
- Action
- Observation
- Loop
Here’s my minimal orchestration loop:
1def agent_loop(user_input):2 context = retrieve_relevant_docs(user_input)3
4 prompt = build_prompt(user_input, context)5
6 response = call_local_llm(prompt)7
8 if needs_tool(response):9 tool_output = execute_tool(parse_tool(response))10 return agent_loop(tool_output)11
12 return responseTool Router
1def execute_tool(tool_call):2 if tool_call["name"] == "read_file":3 return read_file(tool_call["args"]["path"])4
5 if tool_call["name"] == "run_sql":6 return run_sql_query(tool_call["args"]["query"])7
8 raise Exception("Unknown tool")Everything runs inside the same private network.
Performance Benchmarks
Here’s what I measured on:
- GPU: RTX 4090
- RAM: 64GB
- Storage: NVMe
Embedding Throughput
| Model | Avg Latency (ms) | Throughput (req/s) |
|---|---|---|
| bge-large-en-v1.5 | 42 | 23 |
| instructor-large | 58 | 17 |
Retrieval + LLM End-to-End
| Workflow | p50 | p95 | Tokens/sec |
|---|---|---|---|
| Simple RAG | 620ms | 1.2s | 78 |
| Agent w/ 1 tool | 1.4s | 2.8s | 74 |
| Agent w/ 3 tools | 3.9s | 6.5s | 69 |
Cost
| Deployment | Monthly Cost |
|---|---|
| Cloud API (previous) | ~$3,800 |
| On-Prem GPU amortized | ~$650 |
| Marginal per request | $0 |
The cost delta alone justified the migration.
What Broke (And What I Fixed)
1. Embedding Dimension Mismatch
Milvus collections are dimension-locked.
Switching embedding models required rebuilding the index.
Lesson: freeze embedding choice early.
2. Chunking Was Killing Retrieval
Naive 1,000-token chunks destroyed semantic coherence.
Switching to sentence-boundary chunking + 20% overlap improved retrieval precision from 61% → 87%.
3. GPU Memory Fragmentation
Long-running agent loops caused OOM errors.
Fix:
- Reduced max context window
- Enabled tensor parallelism in vLLM
- Restarted inference container nightly
Security & Isolation
If you're running this in an enterprise setting:
- Disable outbound traffic at firewall
- Use internal-only Docker network
- Store embeddings on encrypted disk
- Log every tool invocation
Agent systems executing tools locally can become dangerous if not sandboxed.
I containerized all tools separately and restricted filesystem access.
Final System Capabilities
What this on-prem agent can now do:
- Query internal documentation
- Execute SQL on private databases
- Read local files
- Perform multi-step reasoning
- Run fully offline
And it does all of that without leaking a single byte outside the network.
Tradeoffs
Let’s be honest.
You lose:
- Instant scalability
- Zero-maintenance infra
- Latest proprietary models
You gain:
- Data sovereignty
- Predictable cost
- Full control
- Customization depth
If you're running serious internal workflows — legal, healthcare, finance — the control alone is worth it.
Lessons Learned
Biggest takeaway? Local agentic AI is absolutely viable — but only if you treat it like infrastructure, not a demo.
The LLM is the easy part.
Embedding quality, vector indexing, chunk strategy, and tool isolation determine whether your agent is reliable or reckless.
If you're serious about AI inside your org, stop renting intelligence.
Own the stack.
Read Next
I Run an AI Agent on a VPS. Here's My Actual Setup
A walkthrough of my real OpenClaw deployment: 13 Telegram topics, GPT-5.2 on Azure free tier, heartbeat-driven morning briefings, Playwright browser automation, memsearch semantic recall, and a Second Brain that auto-captures everything I text. Pulled directly from my live droplet.
Sovereignty at Scale: Engineering Production-Grade RAG on Bare Metal
Stop paying the 'Internet Tax' and risking data leaks. We moved our RAG pipeline from SaaS to a local H100 cluster, cutting latency by 40% and TCO by 70% at scale.
Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM
A deep dive into deploying Qwen 3.5 with vLLM for high-throughput inference and running cost-efficient local inference on Azure VMs with GPU acceleration.