ahad.

Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM

AK
Ahad KhanAgentic AI Engineer
March 1, 2026
8 min read
LLMQwenvLLMAzureGPU
Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM

Why Qwen 3.5 Matters for Production LLM Systems

The gap between research-grade LLM demos and production-grade inference systems is massive. Latency spikes. GPU memory bottlenecks. Throughput collapse under concurrency.

This is where Qwen 3.5 becomes interesting.

Qwen 3.5 offers:

  • Strong multilingual reasoning
  • Competitive coding benchmarks
  • Efficient parameter scaling
  • Excellent compatibility with open inference engines like vLLM

The real power isn’t just in the model — it’s in how you deploy it.

In this guide, we’ll cover:

  1. Running Qwen 3.5 with vLLM for high-throughput serving
  2. Optimizing tensor parallelism and memory
  3. Deploying local inference on an Azure GPU VM
  4. Cost-performance tradeoffs

Let’s break it down like engineers.


System Architecture Overview

Here’s the high-level deployment architecture for serving Qwen 3.5 via vLLM on Azure:

Loading diagram...

The key pieces:

  • vLLM Engine — Handles batching, PagedAttention, memory optimization
  • Azure GPU VM — Provides CUDA-enabled GPU hardware
  • OpenAI-Compatible API — Makes integration seamless

What Makes vLLM Special?

Traditional inference servers struggle with:

  • Fragmented KV cache
  • Poor batching under dynamic loads
  • GPU memory waste
  • Low throughput under concurrency

vLLM solves this with PagedAttention, which:

  • Dynamically allocates KV cache blocks
  • Supports continuous batching
  • Maximizes GPU utilization
  • Reduces memory fragmentation

In practice, vLLM improves throughput by 2–4x compared to naive HuggingFace Transformers serving.


Step 1: Running Qwen 3.5 with vLLM (Local or Server)

Install Dependencies

bash
1pip install vllm
2pip install torch --index-url https://download.pytorch.org/whl/cu121

Make sure:

  • CUDA version matches your GPU
  • NVIDIA driver is properly installed

Start vLLM Server with Qwen 3.5

python
1python -m vllm.entrypoints.openai.api_server \
2 --model Qwen/Qwen2.5-7B-Instruct \
3 --dtype float16 \
4 --tensor-parallel-size 1 \
5 --max-model-len 8192 \
6 --gpu-memory-utilization 0.90

Key parameters explained:

ParameterWhat It DoesRecommended Setting
--dtypePrecision modefloat16 or bfloat16
--tensor-parallel-sizeNumber of GPUs1 (single GPU VM)
--max-model-lenContext length8192 or 32k depending on variant
--gpu-memory-utilizationVRAM allocation limit0.85–0.95

Once started, the server exposes an OpenAI-compatible API at:

1http://localhost:8000/v1/chat/completions

Example Request

python
1import requests
2
3response = requests.post(
4 "http://localhost:8000/v1/chat/completions",
5 json={
6 "model": "Qwen/Qwen2.5-7B-Instruct",
7 "messages": [
8 {"role": "user", "content": "Explain PagedAttention in simple terms."}
9 ]
10 }
11)
12
13print(response.json())

That’s it. You now have a production-grade inference server.


Performance Benchmarks (Real-World Expectations)

Let’s compare deployment options.

Deployment TypeGPUTokens/secConcurrencyCost Efficiency
HF TransformersA10~35 tok/sLowMedium
vLLMA10~90 tok/sHighHigh
vLLM + A100A100 80GB180–220 tok/sVery HighVery High
CPU Only32-core<5 tok/sVery LowPoor

vLLM nearly triples throughput under load.


Step 2: Deploying Qwen 3.5 on Azure VM (Local Inference)

Now let’s deploy this properly in the cloud.

1️⃣ Choose the Right Azure VM

Recommended GPU VM types:

VM TypeGPUVRAMGood For
Standard_NC4as_T4_v3T416GB7B models
Standard_NC24ads_A100_v4A10080GB14B+ models
Standard_ND96amsr_A100_v48x A100640GBLarge-scale serving

For Qwen 7B, a T4 or A10 is sufficient.


2️⃣ Create Azure VM

bash
1az vm create \
2 --resource-group myRG \
3 --name qwen-vm \
4 --image Ubuntu2204 \
5 --size Standard_NC4as_T4_v3 \
6 --admin-username azureuser \
7 --generate-ssh-keys

3️⃣ Install NVIDIA Drivers

bash
1sudo apt update
2sudo apt install nvidia-driver-535
3sudo reboot

Verify:

bash
1nvidia-smi

4️⃣ Install CUDA + PyTorch

bash
1pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
2pip install vllm

Optimizing for Local Inference

Enable Swap for Stability

bash
1sudo fallocate -l 16G /swapfile
2sudo chmod 600 /swapfile
3sudo mkswap /swapfile
4sudo swapon /swapfile

Prevents OOM crashes during large context windows.


Tune GPU Memory Usage

For 16GB GPUs:

bash
1--gpu-memory-utilization 0.85
2--max-model-len 4096

For 80GB GPUs:

bash
1--gpu-memory-utilization 0.95
2--max-model-len 32768

Production-Ready Deployment Architecture on Azure

Loading diagram...

Add:

  • NGINX for rate limiting
  • Redis for response caching
  • Prometheus + Grafana for monitoring

Cost Considerations

Running Qwen locally on Azure is dramatically cheaper than API usage at scale.

Example:

ScenarioMonthly CostNotes
OpenAI API (10M tokens/day)$2,000+Usage-based
Azure T4 VM (24/7)~$450Fixed cost
Azure A100 VM (24/7)~$2,200Enterprise scale

If your workload is steady, local inference wins.

If your workload is bursty, API usage might be cheaper.


Advanced: Quantization for Smaller GPUs

For lower VRAM GPUs, use:

  • AWQ quantization
  • GPTQ
  • 4-bit quantized weights

Example:

bash
1--quantization awq

This reduces:

  • VRAM usage by ~50%
  • Slight drop in quality (1–3%)

Observability & Monitoring

Track:

  • GPU utilization
  • Memory fragmentation
  • Tokens/sec
  • p95 latency

Use:

bash
1watch -n 1 nvidia-smi

And integrate:

  • Prometheus exporters
  • Grafana dashboards

Production LLM systems fail not because of models — but because of missing observability.


Common Pitfalls

❌ CUDA mismatch ❌ Using float32 instead of float16 ❌ Not limiting max_model_len ❌ No swap memory ❌ Ignoring KV cache memory growth


Real-World Latency Expectations

For Qwen 7B on T4:

  • First token latency: 400–800ms
  • Generation speed: 80–100 tok/sec
  • Stable concurrency: 15–30 users

On A100:

  • First token latency: 200–400ms
  • Generation speed: 200+ tok/sec
  • 100+ concurrent users possible