How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference

My cloud bill hit four figures before I realized something uncomfortable: most of my “AI infrastructure” was just GPU rental and API tax. Worse — sensitive internal data was flowing through third-party endpoints.

Why Go On-Prem?

So I rebuilt everything locally.

No OpenAI APIs. No managed vector DBs. No SaaS inference. Just open-source embeddings, local LLM inference, and an agent loop running entirely on-prem.

Here’s exactly how I architected and deployed a fully local agentic AI stack — embeddings, retrieval, tool use, orchestration — all running inside my own network.

The Problem

Running agentic AI in the cloud is easy.

Running it locally — with:

private data
deterministic infra
predictable cost
air-gapped deployment

— is not.

The hard parts aren’t the LLM calls. They’re:

Embedding generation at scale
Efficient local vector search
Tool orchestration without SaaS glue
Latency management without hyperscaler GPUs

And if you get embeddings wrong? Your agent becomes confidently useless.

I wanted:

100% local inference
Open-source embedding model
Local vector DB
Multi-step agent loop with tools
No outbound internet access
Runs on a single GPU workstation (A100 preferred, 4090 acceptable)

Architecture

Here’s the high-level system design I ended up with:

Loading diagram...

Core Components

Layer	Stack
LLM Inference	Ollama or vLLM
Embeddings	Milvus-compatible open-source embedding model
Vector Store	Milvus (standalone)
Agent Framework	LangGraph / Custom loop
Tool Execution	Python function router
Storage	Local filesystem or Postgres

Everything runs inside Docker containers on the same host.

No external calls.

Step 1: Local LLM Inference

I evaluated:

Ollama
vLLM
llama.cpp

I landed on vLLM for performance and batching support.

Why vLLM?

Continuous batching
Efficient KV cache reuse
Lower p95 latency under concurrency

Setup

bash

1docker run --gpus all -p 8000:8000 \
2  vllm/vllm-openai \
3  --model mistralai/Mistral-7B-Instruct-v0.2

Now you get an OpenAI-compatible endpoint at:

1http://localhost:8000/v1/chat/completions

Step 2: Open-Source Embeddings (Milvus-Compatible)

Milvus works best with dense vector embeddings.

I used:

BAAI/bge-large-en-v1.5
Instructor-large (for instruction-tuned retrieval)

Both run locally via SentenceTransformers.

Embedding Service

embedding_service.py

1from sentence_transformers import SentenceTransformer
2from fastapi import FastAPI
3
4model = SentenceTransformer("BAAI/bge-large-en-v1.5")
5app = FastAPI()
6
7@app.post("/embed")
8def embed(texts: list[str]):
9    vectors = model.encode(texts, normalize_embeddings=True)
10    return {"vectors": vectors.tolist()}

Run it locally:

bash

1uvicorn embedding_service:app --host 0.0.0.0 --port 9000

Step 3: Milvus Vector Database (On-Prem)

Milvus standalone is straightforward:

bash

1docker-compose up -d

Minimal docker-compose.yml:

docker-compose.yml

1version: '3.5'
2services:
3  milvus:
4    image: milvusdb/milvus:v2.3.0
5    ports:
6      - "19530:19530"
7      - "9091:9091"

Create Collection

milvus_setup.py

1from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
2
3connections.connect("default", host="localhost", port="19530")
4
5fields = [
6    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
7    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
8    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048)
9]
10
11schema = CollectionSchema(fields)
12collection = Collection("knowledge_base", schema)

Now your embeddings are fully local and indexed.

Step 4: Agent Loop (Local Tool Use)

This is where most tutorials fall apart.

An agent isn’t just “LLM + retrieval”. It’s:

Thought
Action
Observation
Loop

Here’s my minimal orchestration loop:

agent.py

1def agent_loop(user_input):
2    context = retrieve_relevant_docs(user_input)
3
4    prompt = build_prompt(user_input, context)
5
6    response = call_local_llm(prompt)
7
8    if needs_tool(response):
9        tool_output = execute_tool(parse_tool(response))
10        return agent_loop(tool_output)
11
12    return response

Tool Router

tools.py

1def execute_tool(tool_call):
2    if tool_call["name"] == "read_file":
3        return read_file(tool_call["args"]["path"])
4
5    if tool_call["name"] == "run_sql":
6        return run_sql_query(tool_call["args"]["query"])
7
8    raise Exception("Unknown tool")

Everything runs inside the same private network.

Performance Benchmarks

Here’s what I measured on:

GPU: RTX 4090
RAM: 64GB
Storage: NVMe

Embedding Throughput

Model	Avg Latency (ms)	Throughput (req/s)
bge-large-en-v1.5	42	23
instructor-large	58	17

Retrieval + LLM End-to-End

Workflow	p50	p95	Tokens/sec
Simple RAG	620ms	1.2s	78
Agent w/ 1 tool	1.4s	2.8s	74
Agent w/ 3 tools	3.9s	6.5s	69

Cost

Deployment	Monthly Cost
Cloud API (previous)	~$3,800
On-Prem GPU amortized	~$650
Marginal per request	$0

The cost delta alone justified the migration.

What Broke (And What I Fixed)

1. Embedding Dimension Mismatch

Milvus collections are dimension-locked.

Switching embedding models required rebuilding the index.

Lesson: freeze embedding choice early.

2. Chunking Was Killing Retrieval

Naive 1,000-token chunks destroyed semantic coherence.

Switching to sentence-boundary chunking + 20% overlap improved retrieval precision from 61% → 87%.

3. GPU Memory Fragmentation

Long-running agent loops caused OOM errors.

Fix:

Reduced max context window
Enabled tensor parallelism in vLLM
Restarted inference container nightly

Security & Isolation

If you're running this in an enterprise setting:

Disable outbound traffic at firewall
Use internal-only Docker network
Store embeddings on encrypted disk
Log every tool invocation

Agent systems executing tools locally can become dangerous if not sandboxed.

I containerized all tools separately and restricted filesystem access.

Final System Capabilities

What this on-prem agent can now do:

Query internal documentation
Execute SQL on private databases
Read local files
Perform multi-step reasoning
Run fully offline

And it does all of that without leaking a single byte outside the network.

Tradeoffs

Let’s be honest.

You lose:

Instant scalability
Zero-maintenance infra
Latest proprietary models

You gain:

Data sovereignty
Predictable cost
Full control
Customization depth

If you're running serious internal workflows — legal, healthcare, finance — the control alone is worth it.

Lessons Learned

Biggest takeaway? Local agentic AI is absolutely viable — but only if you treat it like infrastructure, not a demo.

The LLM is the easy part.

Embedding quality, vector indexing, chunk strategy, and tool isolation determine whether your agent is reliable or reckless.

If you're serious about AI inside your org, stop renting intelligence.

Own the stack.

ahad.

How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference

Why Go On-Prem?

The Problem

Architecture

Core Components

Step 1: Local LLM Inference

Why vLLM?

Setup

Step 2: Open-Source Embeddings (Milvus-Compatible)

Embedding Service

Step 3: Milvus Vector Database (On-Prem)

Create Collection

Step 4: Agent Loop (Local Tool Use)

Tool Router

Performance Benchmarks

Embedding Throughput

Retrieval + LLM End-to-End

Cost

What Broke (And What I Fixed)

1. Embedding Dimension Mismatch

2. Chunking Was Killing Retrieval

3. GPU Memory Fragmentation

Security & Isolation

Final System Capabilities

Tradeoffs

Lessons Learned

Read Next

I Run an AI Agent on a VPS. Here's My Actual Setup

Sovereignty at Scale: Engineering Production-Grade RAG on Bare Metal

Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM