ahad.

How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference

AK
Ahad KhanAgentic AI Engineer
March 3, 2026
7 min read
LLMAgentic AIOn-Prem

My cloud bill hit four figures before I realized something uncomfortable: most of my “AI infrastructure” was just GPU rental and API tax. Worse — sensitive internal data was flowing through third-party endpoints.

Why Go On-Prem?

So I rebuilt everything locally.

No OpenAI APIs. No managed vector DBs. No SaaS inference. Just open-source embeddings, local LLM inference, and an agent loop running entirely on-prem.

Here’s exactly how I architected and deployed a fully local agentic AI stack — embeddings, retrieval, tool use, orchestration — all running inside my own network.


The Problem

Running agentic AI in the cloud is easy.

Running it locally — with:

  • private data
  • deterministic infra
  • predictable cost
  • air-gapped deployment

— is not.

The hard parts aren’t the LLM calls. They’re:

  1. Embedding generation at scale
  2. Efficient local vector search
  3. Tool orchestration without SaaS glue
  4. Latency management without hyperscaler GPUs

And if you get embeddings wrong? Your agent becomes confidently useless.

I wanted:

  • 100% local inference
  • Open-source embedding model
  • Local vector DB
  • Multi-step agent loop with tools
  • No outbound internet access
  • Runs on a single GPU workstation (A100 preferred, 4090 acceptable)

Architecture

Here’s the high-level system design I ended up with:

Loading diagram...

Core Components

LayerStack
LLM InferenceOllama or vLLM
EmbeddingsMilvus-compatible open-source embedding model
Vector StoreMilvus (standalone)
Agent FrameworkLangGraph / Custom loop
Tool ExecutionPython function router
StorageLocal filesystem or Postgres

Everything runs inside Docker containers on the same host.

No external calls.


Step 1: Local LLM Inference

I evaluated:

  • Ollama
  • vLLM
  • llama.cpp

I landed on vLLM for performance and batching support.

Why vLLM?

  • Continuous batching
  • Efficient KV cache reuse
  • Lower p95 latency under concurrency

Setup

bash
1docker run --gpus all -p 8000:8000 \
2 vllm/vllm-openai \
3 --model mistralai/Mistral-7B-Instruct-v0.2

Now you get an OpenAI-compatible endpoint at:

1http://localhost:8000/v1/chat/completions

Step 2: Open-Source Embeddings (Milvus-Compatible)

Milvus works best with dense vector embeddings.

I used:

  • BAAI/bge-large-en-v1.5
  • Instructor-large (for instruction-tuned retrieval)

Both run locally via SentenceTransformers.

Embedding Service

embedding_service.py
1from sentence_transformers import SentenceTransformer
2from fastapi import FastAPI
3
4model = SentenceTransformer("BAAI/bge-large-en-v1.5")
5app = FastAPI()
6
7@app.post("/embed")
8def embed(texts: list[str]):
9 vectors = model.encode(texts, normalize_embeddings=True)
10 return {"vectors": vectors.tolist()}

Run it locally:

bash
1uvicorn embedding_service:app --host 0.0.0.0 --port 9000

Step 3: Milvus Vector Database (On-Prem)

Milvus standalone is straightforward:

bash
1docker-compose up -d

Minimal docker-compose.yml:

docker-compose.yml
1version: '3.5'
2services:
3 milvus:
4 image: milvusdb/milvus:v2.3.0
5 ports:
6 - "19530:19530"
7 - "9091:9091"

Create Collection

milvus_setup.py
1from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
2
3connections.connect("default", host="localhost", port="19530")
4
5fields = [
6 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
7 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
8 FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048)
9]
10
11schema = CollectionSchema(fields)
12collection = Collection("knowledge_base", schema)

Now your embeddings are fully local and indexed.


Step 4: Agent Loop (Local Tool Use)

This is where most tutorials fall apart.

An agent isn’t just “LLM + retrieval”. It’s:

  • Thought
  • Action
  • Observation
  • Loop

Here’s my minimal orchestration loop:

agent.py
1def agent_loop(user_input):
2 context = retrieve_relevant_docs(user_input)
3
4 prompt = build_prompt(user_input, context)
5
6 response = call_local_llm(prompt)
7
8 if needs_tool(response):
9 tool_output = execute_tool(parse_tool(response))
10 return agent_loop(tool_output)
11
12 return response

Tool Router

tools.py
1def execute_tool(tool_call):
2 if tool_call["name"] == "read_file":
3 return read_file(tool_call["args"]["path"])
4
5 if tool_call["name"] == "run_sql":
6 return run_sql_query(tool_call["args"]["query"])
7
8 raise Exception("Unknown tool")

Everything runs inside the same private network.


Performance Benchmarks

Here’s what I measured on:

  • GPU: RTX 4090
  • RAM: 64GB
  • Storage: NVMe

Embedding Throughput

ModelAvg Latency (ms)Throughput (req/s)
bge-large-en-v1.54223
instructor-large5817

Retrieval + LLM End-to-End

Workflowp50p95Tokens/sec
Simple RAG620ms1.2s78
Agent w/ 1 tool1.4s2.8s74
Agent w/ 3 tools3.9s6.5s69

Cost

DeploymentMonthly Cost
Cloud API (previous)~$3,800
On-Prem GPU amortized~$650
Marginal per request$0

The cost delta alone justified the migration.


What Broke (And What I Fixed)

1. Embedding Dimension Mismatch

Milvus collections are dimension-locked.

Switching embedding models required rebuilding the index.

Lesson: freeze embedding choice early.


2. Chunking Was Killing Retrieval

Naive 1,000-token chunks destroyed semantic coherence.

Switching to sentence-boundary chunking + 20% overlap improved retrieval precision from 61% → 87%.


3. GPU Memory Fragmentation

Long-running agent loops caused OOM errors.

Fix:

  • Reduced max context window
  • Enabled tensor parallelism in vLLM
  • Restarted inference container nightly

Security & Isolation

If you're running this in an enterprise setting:

  • Disable outbound traffic at firewall
  • Use internal-only Docker network
  • Store embeddings on encrypted disk
  • Log every tool invocation

Agent systems executing tools locally can become dangerous if not sandboxed.

I containerized all tools separately and restricted filesystem access.


Final System Capabilities

What this on-prem agent can now do:

  • Query internal documentation
  • Execute SQL on private databases
  • Read local files
  • Perform multi-step reasoning
  • Run fully offline

And it does all of that without leaking a single byte outside the network.


Tradeoffs

Let’s be honest.

You lose:

  • Instant scalability
  • Zero-maintenance infra
  • Latest proprietary models

You gain:

  • Data sovereignty
  • Predictable cost
  • Full control
  • Customization depth

If you're running serious internal workflows — legal, healthcare, finance — the control alone is worth it.


Lessons Learned

Biggest takeaway? Local agentic AI is absolutely viable — but only if you treat it like infrastructure, not a demo.

The LLM is the easy part.

Embedding quality, vector indexing, chunk strategy, and tool isolation determine whether your agent is reliable or reckless.

If you're serious about AI inside your org, stop renting intelligence.

Own the stack.