Semantic Intelligence Layer for NVIDIA Dynamo
1. Executive Summaryβ
This proposal outlines a comprehensive integration strategy between vLLM Semantic Router and NVIDIA Dynamo, combining semantic intelligence with high-performance distributed inference. The integration creates a unified inference stack that leverages:
- Semantic Router's intelligent request classification (14 domain categories), domain-aware system prompts, fusion routing (BERT classification + keyword matching + similarity search), security filtering, Milvus-based semantic caching
- Dynamo's disaggregated serving, KV-aware routing, and multi-tier memory management
The result is a production-grade LLM serving platform with system-level intelligence that achieves optimal balance between accuracy (routing to the right model with optimized prompts for best quality) and efficiency (maximizing GPU utilization and minimizing latency), creating a holistically intelligent inference system.
Key Benefits:
- System-level intelligence that optimally balances accuracy and efficiency across the entire inference stack
- Significant cost reduction through intelligent model selection combined with infrastructure optimization
- Substantial latency improvement via semantic caching + KV cache management with adaptive routing strategies
- Enhanced LLM quality with domain-aware system prompts that improve Chain-of-Thought reasoning, token efficiency, and MoE expert matching
- Adaptive routing intelligence with fusion routing: fast path (keyword) to deep analysis (BERT) based on query complexity, maximizing efficiency without sacrificing accuracy
- Multi-signal decision making combining BERT classification, keyword matching, and similarity search for robust and accurate routing
- Holistic content safety with PII detection and jailbreak prevention before inference
- End-to-end observability across semantic and infrastructure layers for continuous system optimization
2. Motivation: Why Semantic Router for Dynamo?β
2.1 Dynamo Router Capabilities (Current State)β
NVIDIA Dynamo provides a sophisticated KV-aware router optimized for infrastructure-level efficiency:
Capability | Description | Optimization Target |
---|---|---|
KV Cache-Aware Routing | Routes requests to workers with highest KV cache hit rate | TTFT, throughput |
Load-Based Routing | Balances active decoding blocks across workers | ITL, GPU utilization |
Cost Function Optimization | Minimizes potential_prefill_blocks + potential_active_blocks | Computational cost |
Temperature-Based Selection | Probabilistic routing to prevent worker saturation | Load distribution |
Event-Driven Tracking | Real-time cache state via worker events | Routing accuracy |
Key Characteristics:
- Infrastructure-focused: Optimizes GPU memory and compute utilization
- Cache-aware: Leverages existing KV caches to reduce prefill cost
- Load-balanced: Distributes decoding workload across workers
- Performance-oriented: Minimizes TTFT and ITL through smart scheduling
2.2 Semantic Router Capabilities (System Intelligence Layer)β
vLLM Semantic Router provides system-level intelligence that operates at the request understanding layer, achieving optimal balance between accuracy and efficiency through intelligent decision-making across 14 domain categories:
Capability | Description | Intelligence Focus |
---|---|---|
Intent Classification | BERT-based categorization (14 categories: math, code, business, law, etc.) | Accuracy: Precise domain understanding |
Model Selection | Routes to best-performing model per category | Accuracy: Task-specific quality optimization |
Domain-Aware System Prompts | Auto-injects category-specific system prompts for prompt engineering | Accuracy: LLM CoT quality, token efficiency, MoE expert matching |
Fusion Routing | Multi-signal routing (keyword + similarity + BERT) | Efficiency: Adaptive latency based on query complexity |
Semantic Caching | Milvus-based vector cache with 0.85+ similarity threshold | Efficiency: Inference cost reduction |
PII Detection | Token-level classification (PERSON, EMAIL, SSN, etc.) | System Intelligence: Privacy protection |
Jailbreak Prevention | Binary classification for prompt injection attacks | System Intelligence: Security enforcement |
Tool Selection | Semantic matching of relevant tools to reduce prompt tokens | Efficiency: Context optimization |
Reasoning Control | Auto-enables reasoning mode for complex queries | Accuracy: Quality-aware mode selection |
System Intelligence Characteristics:
- Holistic Intelligence: Understands query intent, complexity, and security implications across 14 domain categories
- Accuracy-Efficiency Balance: Dynamically selects routing strategy (keyword/similarity/BERT) based on query complexity to maximize accuracy while minimizing latency
- Quality Optimization: Selects models and prompts based on task-specific accuracy requirements
- Intelligent Prompt Engineering: Auto-injects domain-specific system prompts to optimize LLM behavior and output quality
- Proactive Security: Blocks malicious or privacy-violating requests before reaching inference layer
- Cost Intelligence: Avoids expensive models for simple queries while ensuring quality for complex tasks
- Adaptive Routing: Multi-signal fusion routing adapts to query characteristics for optimal accuracy-efficiency tradeoff
2.2.1 14 Domain Categories with System Promptsβ
Semantic Router classifies queries into 14 specialized categories: math, computer science, physics, chemistry, biology, engineering, economics, business, law, psychology, philosophy, history, health, and other. Each category has an optimized system prompt automatically injected based on query classification.
System Prompt Benefits:
-
Improved Chain-of-Thought (CoT): Domain-specific prompts guide LLMs to use appropriate reasoning patterns
- Math: "Provide step-by-step solutions, show your work clearly"
- Law: "Provide accurate legal information while clearly stating disclaimers"
- Business: "Provide practical, actionable advice backed by proven methodologies"
-
Token Efficiency: Optimized prompts reduce unnecessary verbosity while maintaining quality
- Shorter, focused prompts for straightforward categories (business, history)
- Detailed prompts for complex domains requiring specific methodologies (math, physics)
-
MoE Expert Matching: Well-crafted system prompts improve expert selection in Mixture-of-Experts models
- Domain-specific terminology activates relevant experts
- Consistent prompt structure improves expert routing accuracy
- Example: "You are a mathematics expert" β activates math-specialized experts in DeepSeek-V3
-
Quality Control: Category-specific disclaimers and ethical guidelines
- Medical/Legal: Explicit disclaimers about professional consultation
- Psychology: Emphasis on evidence-based approaches
- Health: Clear boundaries between information and medical advice
Example System Prompt (Math Category):
You are a mathematics expert. Provide step-by-step solutions, show your
work clearly, and explain mathematical concepts in an understandable way.
Example System Prompt (Business Category):
You are a senior business consultant and strategic advisor with expertise
in corporate strategy, operations management, financial analysis, marketing,
and organizational development. Provide practical, actionable business advice
backed by proven methodologies and industry best practices. Consider market
dynamics, competitive landscape, and stakeholder interests in your recommendations.
2.2.2 Fusion Routing Strategyβ
Semantic Router implements a multi-signal fusion routing approach that combines three complementary routing methods (as detailed in the Prompt Classification Routing proposal):
1. Keyword-Based Routing (Fast Path)
- Deterministic routing for technology-specific terms (e.g., "kubernetes", "SQL", "React")
- Latency: Minimal (significantly faster than BERT classification)
- Boolean logic support (AND/OR operators)
- Easy to update without model retraining
- Use case: Exact term matching for known patterns
2. Similarity-Based Routing (Semantic Path)
- Embedding similarity for semantic concept detection
- Robust to paraphrasing ("step-by-step" β "explain thoroughly")
- Configurable similarity thresholds (default: 0.75)
- Latency: Low (faster than full BERT classification)
- Use case: Semantic concept matching beyond exact terms
3. BERT Classification (Deep Understanding Path)
- 14-category classification with ModernBERT
- Highest accuracy for complex queries
- Latency: Moderate (comprehensive analysis)
- Use case: Comprehensive intent understanding
Signal Fusion Layer:
- Policy-driven decision making: Combines signals with configurable priority
- Routing logic:
- Check keyword rules first (fastest)
- If no keyword match, check similarity rules
- If no similarity match, use BERT classification (fallback)
- Confidence scoring: Each signal provides confidence score
- Override mechanism: High-confidence signals can override lower-priority signals
- Observability: All signals logged for analysis
System Intelligence Benefits of Fusion Routing:
- Accuracy-Efficiency Balance: Dynamically selects routing strategy based on query complexityβfast path (keyword) for deterministic patterns achieves minimal latency, while deep analysis (BERT) for complex queries ensures maximum accuracy
- Adaptive Intelligence: System automatically chooses the most efficient signal that meets accuracy requirements, avoiding unnecessary computation
- Flexibility: Easy to add new routing rules without model retraining, enabling continuous system optimization
- Robustness: Multiple signals provide redundancy and cross-validation, reducing misclassification risk and improving overall system reliability
- Holistic Optimization: Considers both accuracy and efficiency in every routing decision, maximizing system-level intelligence
2.3 Differentiation Analysis: Complementary Strengthsβ
The two systems operate at different layers of the inference stack with minimal overlap:
Semantic Router: Request Intelligence Layerβ
User Query β [Semantic Understanding] β Model Selection β Request Enrichment
- What: Understands query semantics, intent, and safety
- Why: Routes to the right model for the task
- When: Before request reaches infrastructure
- Optimization: Accuracy, cost, security
Dynamo Router: Infrastructure Efficiency Layerβ
Enriched Request β [Worker Selection] β KV Cache Optimization β GPU Scheduling
- What: Optimizes worker selection and resource allocation
- Why: Maximizes GPU utilization and minimizes latency
- When: After model selection, during execution
- Optimization: TTFT, ITL, throughput
Integration Value Propositionβ
Dimension | Semantic Router Alone | Dynamo Router Alone | Integrated System |
---|---|---|---|
Model Selection | β Semantic accuracy (14 categories) | β No model awareness | β Best model for task |
Worker Selection | β No worker awareness | β KV cache optimization | β Optimal worker for model |
Prompt Engineering | β Domain-aware system prompts | β No prompt optimization | β Optimized CoT & MoE matching |
Fusion Routing | β BERT + keyword + similarity fusion | β KV-aware only | β Multi-signal intelligent routing |
Caching | β Semantic similarity (Milvus) | β KV cache reuse | β β Dual-layer caching |
Security | β PII + jailbreak | β No security layer | β Pre-inference filtering |
Cost Optimization | β Cross-Model-level | β Infrastructure-level | β β End-to-end optimization |
Latency | Adaptive (fusion routing) | Low routing overhead | Parallel execution |
Concrete Example:
Query: "Explain the proof of Fermat's Last Theorem step-by-step"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Semantic Router Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Fusion Routing (3-signal analysis): β
β a) Keyword Match: "theorem", "proof" β math (confidence: 0.8)β
β b) Similarity Search: matches "mathematical proofs" concept β
β (similarity: 0.87) β
β c) BERT Classification: "math" category (confidence: 0.92) β
β β Final Decision: "math" (multi-signal consensus) β
β 2. Model Selection: deepseek-v31 (best for math reasoning) β
β 3. System Prompt Injection: β
β "You are a mathematics expert. Provide step-by-step β
β solutions, show your work clearly, and explain β
β mathematical concepts in an understandable way." β
β 4. Reasoning Mode: ENABLED (entropy-based decision) β
β 5. Security: PASS (no PII, no jailbreak) β
β 6. Semantic Cache: MISS (novel query) β
β 7. Enriched Request: β
β - model=deepseek-v31 β
β - reasoning_effort=high β
β - system_prompt=<math expert prompt> β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dynamo Router Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Worker Pool: [worker-1, worker-2, worker-3] (deepseek-v31) β
β 2. KV Cache Analysis: β
β - worker-1: 15 cached blocks (math proofs context) β
β - worker-2: 3 cached blocks β
β - worker-3: 0 cached blocks β
β 3. Cost Calculation: β
β - worker-1: 85 prefill + 25 active = 110 (BEST) β
β - worker-2: 97 prefill + 20 active = 117 β
β - worker-3: 100 prefill + 18 active = 118 β
β 4. Selection: worker-1 (significant prefill cost reduction) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Result:
- Right model (deepseek-v31 for math reasoning)
- Right worker (worker-1 with relevant KV cache)
- Right mode (reasoning enabled)
- Significantly faster TTFT vs. random worker selection
2.4 Why Integration Matters: Achieving System-Level Intelligenceβ
Challenge 1: Infrastructure without Intelligence
- Dynamo optimizes infrastructure efficiency but lacks semantic understanding
- Cannot distinguish between "2+2=?" and "Prove Fermat's Last Theorem"
- Routes both to the same model pool without understanding complexity or quality requirements
- No ability to select specialized models (math vs. code vs. creative) based on task characteristics
Challenge 2: Intelligence without Infrastructure Awareness
- Semantic Router provides intelligent model selection but lacks infrastructure visibility
- Selects the right model but not the optimal worker
- Cannot leverage KV cache reuse across workers
- No awareness of GPU utilization or worker load for efficiency optimization
Solution: Holistic System Intelligence through Layered Integration
System Intelligence Layer (Semantic Router)
β [accuracy: model selection, quality optimization, security]
β [efficiency: semantic cache, adaptive routing, cost control]
Infrastructure Optimization Layer (Dynamo)
β [efficiency: worker selection, KV cache, GPU scheduling]
β [accuracy: consistent execution, reliable serving]
Execution Layer (vLLM/SGLang/TRT-LLM)
Result: A holistically intelligent system that optimizes for both accuracy (right model, right prompt, right quality) and efficiency (right worker, right cache, right resource utilization) at every layer.
3. Goals and Non-Goalsβ
3.1 Goalsβ
Primary Goals:
- Seamless Integration: Semantic Router operates as a pre-processing layer before Dynamo's router
- Dual-Layer Caching: Semantic cache (request-level) + KV cache (token-level) work in tandem
- Model-Aware Routing: Dynamo routes to worker pools filtered by Semantic Router's model selection
- Security Enforcement: PII and jailbreak detection before requests reach Dynamo
- Unified Observability: Single trace spans both semantic and infrastructure layers
- Zero Downtime: Hot-reload of semantic routing rules without Dynamo restart
Secondary Goals:
- Performance: Combined latency < 50ms (semantic + infrastructure routing)
- Scalability: Support 10K+ RPS with horizontal scaling
- Flexibility: Support multiple deployment patterns (sidecar, gateway, embedded)
3.2 Non-Goalsβ
- Replacing Dynamo Router: Semantic Router augments, not replaces, Dynamo's KV-aware routing
- Modifying Dynamo Core: Integration via standard APIs, no Dynamo internals changes required
- Unified Configuration: Maintain separate configs for semantic and infrastructure layers
- Synchronous Coupling: Systems can operate independently if needed
4. Proposal Detailsβ
4.1 Deep Learning Modelsβ
The Semantic Router leverages four specialized deep learning models for intelligent request processing. The system uses a combination of BERT and ModernBERT architectures optimized for different tasks.
4.1.1 Similarity Model (BERT Embeddings)β
Purpose: Generate embeddings for semantic similarity comparison
Model: sentence-transformers/all-MiniLM-L12-v2
Key Features:
- Architecture: BERT-based (microsoft/MiniLM-L12-H384-uncased)
- 12 layers, 384 hidden dimensions, 12 attention heads
- Fine-tuned on 1B+ sentence pairs using contrastive learning
- Base model: Standard BERT architecture (not ModernBERT)
- Embedding Dimension: 384
- Use Cases:
- Semantic cache similarity matching (threshold: 0.8)
- Tool selection via semantic search (threshold: 0.2)
- Similarity-based routing for semantic concepts
- Deployment: CPU-optimized for cost efficiency
- Model Size: 33.4M parameters (~120 MB)
Configuration:
bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.6
use_cpu: true
Why BERT (not ModernBERT)?
- Mature, well-tested model with proven performance
- Optimized for sentence embeddings via contrastive learning
- Smaller model size (120 MB) for faster loading
- ModernBERT (released Dec 2024) is used for classification tasks below
4.1.2 Classification Model (Category Detection)β
Purpose: Classify queries into 14 domain categories
Model: models/category_classifier_modernbert-base_model
Key Features:
- Architecture: ModernBERT-base (released Dec 2024)
- Modern replacement for BERT with improved architecture
- 8192 token context length (vs. BERT's 512)
- Rotary Position Embeddings (RoPE) for better long-context handling
- Flash Attention 2 for faster inference
- Fine-tuned on MMLU-Pro dataset for domain classification
- Categories: 14 domains (math, computer_science, physics, chemistry, biology, engineering, economics, business, law, psychology, philosophy, history, health, other)
- Output: Category label + confidence score
- Threshold: 0.6 (configurable)
- Training Data: MMLU-Pro dataset with domain-specific examples
- Model Size: ~149M parameters (ModernBERT-base)
Configuration:
classifier:
category_model:
model_id: "models/category_classifier_modernbert-base_model"
use_modernbert: true
threshold: 0.6
use_cpu: true
category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
Model Selection Impact:
- Determines which LLM to route to (e.g., DeepSeek-V3 for math, Qwen3 for business)
- Triggers domain-specific system prompt injection
- Controls reasoning mode activation
4.1.3 PII Detection Model (Privacy Protection)β
Purpose: Detect personally identifiable information at token level
Model: models/pii_classifier_modernbert-base_presidio_token_model
Key Features:
- Architecture: ModernBERT-base fine-tuned for token classification
- Token-level sequence labeling (BIO tagging scheme)
- Fine-tuned on Microsoft Presidio dataset
- Optimized for privacy-sensitive entity detection
- PII Types Detected: 17 types including:
- Identity:
PERSON
,AGE
,NRP
(nationality/religious/political) - Contact:
EMAIL_ADDRESS
,PHONE_NUMBER
,STREET_ADDRESS
,ZIP_CODE
- Financial:
CREDIT_CARD
,IBAN_CODE
,US_SSN
,US_DRIVER_LICENSE
- Technical:
IP_ADDRESS
,DOMAIN_NAME
- Organizational:
ORGANIZATION
,GPE
(geopolitical entity) - Temporal:
DATE_TIME
- Identity:
- Granularity: Token-level classification (not just entity-level)
- Threshold: 0.7 (configurable)
- Action: Block requests violating model-specific PII policies
- Model Size: ~149M parameters (ModernBERT-base)
Configuration:
classifier:
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
use_modernbert: true
threshold: 0.7
use_cpu: true
pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json"
Policy Enforcement:
model_config:
public-model:
pii_policy:
allow_by_default: false
pii_types_allowed: ["PERSON"] # Only person names allowed
Response Headers (when blocked):
x-vsr-pii-violation: true
4.1.4 Jailbreak Detection Model (Security)β
Purpose: Detect adversarial prompts and jailbreak attempts
Model: Auto-discovered from models/
directory
Key Features:
- Architecture: Multiple options with automatic selection
- LoRA models (preferred): Fine-tuned adapters on BERT/RoBERTa/ModernBERT base
lora_jailbreak_classifier_bert_model
(Priority 1)lora_jailbreak_classifier_roberta_model
(Priority 2)lora_jailbreak_classifier_modernbert_model
(Priority 3)
- Legacy model (fallback):
jailbreak_classifier_modernbert-base_model
- LoRA models offer better accuracy with smaller size (~10-20 MB adapters)
- LoRA models (preferred): Fine-tuned adapters on BERT/RoBERTa/ModernBERT base
- Model Discovery: Automatic selection with architecture priority: BERT > RoBERTa > ModernBERT
- Detection Types:
- Prompt injection attacks
- Instruction override attempts
- Adversarial prompts
- Social engineering
- Threshold: 0.7 (configurable)
- Action: Block requests with confidence above threshold
- Model Size:
- LoRA: ~10-20 MB (adapter only) + base model
- Legacy: ~149M parameters (ModernBERT-base)
Configuration:
prompt_guard:
enabled: true
use_modernbert: true
threshold: 0.7
use_cpu: true
# model_id and jailbreak_mapping_path are auto-discovered
Response Headers (when blocked):
x-vsr-jailbreak-blocked: true
x-vsr-jailbreak-type: {type}
(e.g., "prompt_injection")x-vsr-jailbreak-confidence: {score}
(e.g., "0.950")
4.1.5 Model Performance Summaryβ
Model | Purpose | Architecture | Parameters | Threshold | CPU/GPU |
---|---|---|---|---|---|
Similarity | Semantic matching | BERT (MiniLM-L12) | 33.4M | 0.6-0.8 | CPU |
Classification | Category detection | ModernBERT-base | 149M | 0.6 | CPU |
PII Detection | Privacy protection | ModernBERT-base | 149M | 0.7 | CPU |
Jailbreak | Security filtering | ModernBERT-base/LoRA | 149M + adapters | 0.7 | CPU |
Architecture Comparison:
Feature | BERT (MiniLM) | ModernBERT |
---|---|---|
Release Date | 2020 | December 2024 |
Context Length | 512 tokens | 8192 tokens |
Position Encoding | Absolute | RoPE (Rotary) |
Attention | Standard | Flash Attention 2 |
Use Case | Embeddings | Classification |
Model Size | 33.4M params | 149M params |
Optimization Strategies:
- Parallel Execution: PII and Jailbreak detection run in parallel
- Early Exit: Cache hits bypass all model inference
- Keyword Routing: Fast path for deterministic patterns
- CPU Optimization: All models optimized for CPU inference to reduce cost
- LoRA Adapters: Jailbreak model uses lightweight adapters for faster loading
4.2 Design Principlesβ
- Separation of Concerns: Semantic intelligence and infrastructure optimization remain decoupled
- API-Driven Integration: Use Dynamo's frontend API and worker registration mechanisms
- Fail-Safe Design: Semantic Router failure falls back to Dynamo's default routing
- Observability-First: Every decision (semantic + infrastructure) is traced and logged
- Kubernetes-Native: Designed for cloud-native deployment with CRDs and operators
4.3 System Architectureβ
Architecture Layers:
-
Semantic Intelligence Layer (Semantic Router)
- Envoy Gateway with ExtProc for request interception
- BERT-based classification and security filtering
- Semantic caching with Milvus backend
- Request enrichment with routing metadata
-
Infrastructure Optimization Layer (Dynamo)
- Dynamo Frontend receives enriched requests
- KV Router performs model-aware worker selection
- Planner handles dynamic scaling
- KVBM manages multi-tier KV cache
-
Execution Layer (vLLM/SGLang/TRT-LLM)
- Model-specific worker pools
- Disaggregated prefill/decode workers
- Backend-agnostic execution
-
Storage Layer
- Milvus for semantic cache
- System memory for KV cache offload
- NVMe for cold KV cache storage
4.4 Request Flowβ
4.4.1 End-to-End Request Processingβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1: Semantic Intelligence (Semantic Router) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Step 1: Request Interception β
β - Envoy Gateway receives OpenAI API request β
β - ExtProc gRPC call to Semantic Router β
β - Extract query from messages array β
β β
β Step 2: Security Filtering (Parallel Execution) β
β - PII Detection: Scan for PERSON, EMAIL, SSN, etc. β
β - Jailbreak Detection: Binary classification for prompt injection β
β - Action: BLOCK if security violation detected β
β - Latency: Low β
β β
β Step 3: Semantic Cache Lookup β
β - Generate BERT embedding for query β
β - Search Milvus for similar queries (threshold: 0.85) β
β - Action: Return cached response if HIT β
β - Latency: Very low (cache hit), Low (cache miss) β
β β
β Step 4: Fusion Routing (Multi-Signal Classification) β
β - Signal 1: Keyword matching (fast path) β
β - Signal 2: Similarity search (semantic concepts) β
β - Signal 3: BERT classification (deep understanding) β
β - Entropy-based reasoning decision β
β - Category: math, code, reasoning, creative, etc. β
β - Latency: Adaptive (keyword: minimal, similarity: low, BERT: moderate) β
β β
β Step 5: Model Selection β
β - Lookup category β model scores mapping β
β - Select best-performing model for category β
β - Example: "math" β deepseek-v31 (score: 0.92) β
β β
β Step 6: Request Enrichment β
β - Add headers: β
β * X-VSR-Model: deepseek-v31 β
β * X-VSR-Category: math β
β * X-VSR-Reasoning: true β
β * X-VSR-Reasoning-Effort: high β
β * X-VSR-Cache-Status: miss β
β - Modify request body: β
β * Update "model" field to selected model β
β * Inject reasoning parameters if applicable β
β * Add selected tools if tool selection enabled β
β β
β Total Latency: Low to Moderate (parallel execution) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 2: Infrastructure Optimization (Dynamo) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Step 7: Dynamo Frontend Receives Request β
β - Parse X-VSR-Model header β
β - Filter worker pool to model-specific workers β
β - Example: Only consider workers serving deepseek-v31 β
β β
β Step 8: KV-Aware Worker Selection β
β - Query KVBM for cached blocks per worker β
β - Calculate cost for each worker: β
β * potential_prefill_blocks = (input_tokens - overlap_blocks) / block_sizeβ
β * potential_active_blocks = current_active + new_request_blocks β
β * logit = kv_overlap_weight Γ prefill + active β
β - Select worker with lowest cost β
β - Latency: Low β
β β
β Step 9: Request Forwarding β
β - Forward to selected worker (prefill or decode) β
β - Worker processes request with vLLM/SGLang/TRT-LLM β
β - KVBM tracks new KV cache blocks β
β β
β Total Latency: Low (routing overhead) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 3: Response Processing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Step 10: Worker Response β
β - vLLM/SGLang generates tokens β
β - Stream response back to Dynamo Frontend β
β β
β Step 11: Semantic Cache Update β
β - Semantic Router receives response via ExtProc β
β - Store query embedding + response in Milvus β
β - TTL: 7200 seconds (configurable) β
β β
β Step 12: Response to Client β
β - Envoy Gateway forwards response β
β - Add response headers: β
β * X-VSR-Model-Used: deepseek-v31 β
β * X-VSR-Cache-Hit: false β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.4.2 Dual-Layer Caching Strategyβ
The integration leverages two complementary caching layers:
Layer 1: Semantic Cache (Request-Level)
- Granularity: Entire request-response pairs
- Matching: Embedding similarity (cosine distance)
- Threshold: 0.85 (configurable)
- Backend: Milvus (vector database)
- Benefit: Avoids inference entirely for similar queries
- Example: "What is 2+2?" β "Calculate 2 plus 2" (similarity: 0.91)
Layer 2: KV Cache (Token-Level)
- Granularity: Token-level KV cache blocks
- Matching: Exact prefix matching
- Backend: GPU HBM β System Memory β NVMe
- Benefit: Reduces prefill cost for partial overlaps
- Example: "Explain quantum computing" β "Explain quantum computing applications" (prefix reuse)
Combined Benefit:
Scenario 1: Exact Semantic Match
Query: "What is the capital of France?"
Semantic Cache: HIT (high similarity with "What's France's capital?")
KV Cache: N/A (inference skipped)
Latency: Very low (cache lookup only)
Cost Reduction: Maximum (no inference)
Scenario 2: Partial Semantic Match + KV Reuse
Query: "Explain the proof of Fermat's Last Theorem in detail"
Semantic Cache: MISS (novel query)
KV Cache: HIT (significant overlap with "Explain Fermat's Last Theorem")
Latency: Reduced (vs. without KV reuse)
Cost Reduction: Significant (prefill cost saved)
Scenario 3: Novel Query
Query: "Design a distributed consensus algorithm for blockchain"
Semantic Cache: MISS
KV Cache: MISS
Latency: Standard (full inference)
Cost Reduction: None (but routed to best model)
4.5 Integration in Kubernetesβ
4.5.1 Deployment Architectureβ
The integration follows a layered service architecture in Kubernetes, with clear separation between semantic intelligence and infrastructure optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster: llm-inference-stack β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Layer 1: Gateway & Semantic Intelligence β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β β β
β β [Envoy Gateway] β β
β β β (ExtProc gRPC) β β
β β [Semantic Router Service] β β
β β - Pods: 3 replicas (HA) β β
β β - Port: 50051 (gRPC) β β
β β - Functions: β β
β β * BERT classification (14 categories) β β
β β * System prompt injection β β
β β * PII/Jailbreak detection β β
β β * Semantic cache lookup β β
β β * Model selection β β
β β - Dependencies: β β
β β * Milvus Service (semantic cache) β β
β β * ConfigMap (routing rules) β β
β β * PVC (ML models) β β
β β β β
β β [Milvus Service] β β
β β - Port: 19530 (gRPC) β β
β β - Vector database for semantic caching β β
β β - Storage: PVC for persistence β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β (HTTP with headers: β
β X-VSR-Model, X-VSR-Category, etc.) β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Layer 2: Infrastructure Optimization (Dynamo) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β β β
β β [Dynamo Frontend Service] β β
β β - Pods: 2 replicas (HA) β β
β β - Port: 8000 (HTTP) β β
β β - Functions: β β
β β * Parse X-VSR-Model header β β
β β * Filter worker pool by model β β
β β * KV-aware worker selection β β
β β * Request forwarding β β
β β - Components: β β
β β * KV Router β β
β β * Planner (dynamic scaling) β β
β β * KVBM (KV cache manager) β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β (Worker selection based on β
β model + KV cache state) β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Layer 3: Execution (vLLM/SGLang Workers) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β β β
β β [Model Pool: deepseek-v31] β β
β β - StatefulSet: Multiple replicas β β
β β - Service: vllm-deepseek-v31-svc β β
β β - GPU: Multi-GPU per pod β β
β β - Features: prefix caching, fp8 KV cache β β
β β β β
β β [Model Pool: qwen3] β β
β β - StatefulSet: Multiple replicas β β
β β - Service: vllm-qwen3-svc β β
β β - GPU: Multi-GPU per pod β β
β β β β
β β [Model Pool: phi4] β β
β β - StatefulSet: Multiple replicas β β
β β - Service: vllm-phi4-svc β β
β β - GPU: Single/Multi-GPU per pod β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Kubernetes Services:
-
semantic-router-svc (ClusterIP)
- Exposes Semantic Router ExtProc on port 50051
- Used by Envoy Gateway for request processing
- Selector:
app=semantic-router
-
dynamo-frontend-svc (ClusterIP)
- Exposes Dynamo Frontend on port 8000
- Receives enriched requests from Envoy Gateway
- Selector:
app=dynamo-frontend
-
vllm-{model}-svc (Headless Service)
- One service per model pool
- Enables direct pod-to-pod communication
- Used by Dynamo for worker selection
- Selector:
app=vllm-worker, model=\{model-name\}
-
milvus-svc (ClusterIP)
- Exposes Milvus on port 19530 (gRPC)
- Used by Semantic Router for semantic caching
- Vector database for embedding similarity search
- Selector:
app=milvus
4.5.2 Service Communication Flowβ
End-to-End Request Path:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 1: Client Request β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β POST /v1/chat/completions β
β Host: llm-gateway.example.com:8080 β
β Content-Type: application/json β
β β
β { β
β "messages": [ β
β {"role": "user", "content": "Prove Fermat's Last Theorem"} β
β ], β
β "model": "auto" β
β } β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 2: Envoy Gateway (Port 8080) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β - Receives HTTP request β
β - Invokes ExtProc: semantic-router-svc:50051 (gRPC) β
β - Sends request body + headers to Semantic Router β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 3: Semantic Router Service (ExtProc gRPC) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Processing Pipeline: β
β β
β 3.1 Fusion Routing (Multi-Signal Classification) β
β - Input: "Prove Fermat's Last Theorem" β
β - Keyword matching: No match β
β - Similarity search: No strong match β
β - BERT classification: category="math", confidence=0.92 β
β - Decision: Use BERT result (highest confidence) β
β β
β 3.2 System Prompt Selection β
β - Lookup: categories["math"].system_prompt β
β - Prompt: "You are a mathematics expert..." β
β β
β 3.3 Model Selection β
β - Lookup: categories["math"].model_scores β
β - Selected: deepseek-v31 (score: 0.92, reasoning: true) β
β β
β 3.4 Security Checks β
β - PII Detection: PASS (no sensitive data) β
β - Jailbreak Detection: PASS (legitimate query) β
β β
β 3.5 Semantic Cache Lookup β
β - Query Milvus: embedding similarity search β
β - Result: MISS (novel query) β
β β
β 3.6 Response to Envoy β
β - Modified Request Body: β
β * model: "auto" β "deepseek-v31" (OVERRIDDEN) β
β * messages: [system prompt injected] β
β - Observability Headers (optional, added to response): β
β * x-vsr-selected-category: math β
β * x-vsr-selected-reasoning: on β
β * x-vsr-selected-model: deepseek-v31 β
β * x-vsr-injected-system-prompt: true β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββ
β Step 4: Envoy Gateway (Forwarding) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β - Receives enriched request from Semantic Router β
β - Forwards to: dynamo-frontend-svc:8000 β
β - Request body now has: model="deepseek-v31" (overridden from "auto")β
β - Optional observability headers preserved β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 5: Dynamo Frontend Service (Port 8000) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Processing Pipeline: β
β β
β 5.1 Request Body Parsing β
β - Read: request.model = "deepseek-v31" β
β - Dynamo is UNAWARE that model was changed by VSR β
β - Treats it as a normal request for deepseek-v31 β
β β
β 5.2 Worker Pool Filtering β
β - Query Kubernetes: vllm-deepseek-v31-svc (Headless) β
β - Available Workers: β
β * vllm-deepseek-v31-0 (10.244.1.5:8000) β
β * vllm-deepseek-v31-1 (10.244.1.6:8000) β
β * vllm-deepseek-v31-2 (10.244.1.7:8000) β
β * vllm-deepseek-v31-3 (10.244.1.8:8000) β
β β
β 5.3 KV-Aware Worker Selection β
β - Query KVBM for each worker's cache state β
β - Calculate routing score: β
β score = kv_overlap Γ weight + active_blocks β
β - Results: β
β * Worker-0: score=120 (high KV overlap) β
β * Worker-1: score=85 β
β * Worker-2: score=90 β
β * Worker-3: score=75 β
β - Selected: Worker-0 (10.244.1.5:8000) β
β β
β 5.4 Request Forwarding β
β - Forward to: http://10.244.1.5:8000/v1/chat/completions β
β - Request body: model="deepseek-v31" (as-is from VSR) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 6: vLLM Worker (deepseek-v31-0) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 6.1 Request Processing β
β - Receive request: model="deepseek-v31" β
β - System prompt already injected in messages by VSR β
β - Worker is UNAWARE of VSR's involvement β
β β
β 6.2 Inference Execution β
β - Model: DeepSeek-V3 β
β - Messages: [system prompt + user query] β
β - Prefix Caching: Enabled (KV cache reuse) β
β - Generate response with step-by-step proof β
β β
β 6.3 Response Generation β
β - Return: Streaming or non-streaming response β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 7: Response Path (Reverse) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Worker β Dynamo Frontend β Envoy Gateway β Client β
β β
β - Envoy adds observability headers: X-Envoy-Upstream-Service-Time β
β - Client receives complete response with metadata β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Integration Points:
-
Transparent Model Override (Critical Design)
- User sends:
{"model": "auto", "messages": [...]}
- Semantic Router modifies request body:
model: "auto" β "deepseek-v31"
- Dynamo receives:
{"model": "deepseek-v31", "messages": [...]}
- Dynamo is completely unaware of VSR's involvement
- No special headers needed for model routing
- Standard OpenAI API compatibility maintained
- User sends:
-
System Prompt Injection
- Semantic Router injects system prompt into messages array
- Example:
messages: [{"role": "system", "content": "You are a mathematics expert..."}, {"role": "user", "content": "..."}]
- Worker receives pre-enriched request
- No additional processing needed by Dynamo or worker
-
Service Discovery
- Envoy β Semantic Router:
semantic-router-svc.llm-inference-stack.svc.cluster.local:50051
(gRPC ExtProc) - Envoy β Dynamo:
dynamo-frontend-svc.llm-inference-stack.svc.cluster.local:8000
(HTTP) - Dynamo β Workers:
vllm-\{model\}-svc.llm-inference-stack.svc.cluster.local
(Headless Service) - Semantic Router β Milvus:
milvus-svc.llm-inference-stack.svc.cluster.local:19530
(gRPC)
- Envoy β Semantic Router:
-
Observability (Optional Headers)
x-vsr-selected-category
: Query classification result (e.g., "math")x-vsr-selected-reasoning
: Reasoning mode flag (e.g., "on" or "off")x-vsr-selected-model
: Model selected by VSR (e.g., "deepseek-v31")x-vsr-injected-system-prompt
: Whether system prompt was injected (e.g., "true" or "false")x-vsr-cache-hit
: Semantic cache status (value: "true" when cache hit)- These headers are for observability only, not used by Dynamo for routing
- Dynamo and workers can ignore these headers
- Headers are only added to successful responses (HTTP 200-299) that did not hit cache
-
Distributed Tracing
- Full-stack distributed tracing support across VSR β Dynamo β Workers
- OpenTelemetry-based instrumentation
- Single trace spans all layers with proper context propagation
- Reference: PR #322 - Distributed Tracing Support
- Enables end-to-end latency analysis and bottleneck identification
-
Cache Coordination
- Semantic cache (Milvus): Request-level, checked first by VSR
- KV cache (Dynamo/vLLM): Token-level, managed by Dynamo
- Independent layers, no coordination needed
- If semantic cache hits, request never reaches Dynamo
4.5.3 Worker Pool Managementβ
Worker Discovery via Kubernetes Services:
Dynamo Frontend discovers workers through Kubernetes Headless Services, which provide direct pod IP addresses:
-
Headless Service Configuration
- Service Type:
ClusterIP: None
(headless) - Selector:
app=vllm-worker, model=\{model-name\}
- DNS returns all pod IPs instead of load-balanced VIP
- Service Type:
-
Worker Registration Flow
vLLM Worker Pod Startup
β
Worker registers with Dynamo Frontend via HTTP API
β
Dynamo Frontend tracks:
- Worker ID (pod name)
- Model name (deepseek-v31, qwen3, phi4)
- Endpoint (pod IP:8000)
- Capabilities (prefill, decode, max_batch_size)
- KV cache state (tracked by KVBM) -
Model Pool Organization
- Each model has dedicated StatefulSet + Headless Service
- Example:
vllm-deepseek-v31-svc
β 4 pods serving DeepSeek-V3 - Dynamo queries service DNS to get all pod IPs
- Filters workers by
X-VSR-Model
header from Semantic Router
-
Dynamic Scaling
- Horizontal Pod Autoscaler (HPA) adjusts replicas based on GPU utilization
- New pods auto-register with Dynamo on startup
- Dynamo updates worker pool in real-time
4.6 Implementation Planβ
Phase 1: Foundationβ
Objectives:
- Establish basic integration between Semantic Router and Dynamo
- Implement transparent model override in request body
- Validate end-to-end request flow
Tasks:
-
Semantic Router Enhancements:
- Implement request body modification:
model: "auto" β "selected-model"
- Add system prompt injection to messages array
- Add optional observability headers:
x-vsr-selected-category
: Classification resultx-vsr-selected-reasoning
: Reasoning mode ("on" or "off")x-vsr-selected-model
: Selected model namex-vsr-injected-system-prompt
: System prompt injection status ("true" or "false")x-vsr-cache-hit
: Cache hit status (only when cache hit)
- Ensure OpenAI API compatibility maintained
- Implement request body modification:
-
Dynamo Frontend (No Changes Required):
- Dynamo receives standard OpenAI API requests
- Model field already contains the selected model name
- No awareness of VSR's involvement needed
- Existing routing logic works as-is
-
Testing:
- Unit tests for model override logic
- Integration tests for system prompt injection
- Verify Dynamo routes to correct model pools
- Load tests with 1K RPS
Success Criteria:
- β Requests routed to correct model pools based on overridden model name
- β System prompts correctly injected into messages
- β Dynamo operates transparently without modifications
- β Latency overhead < 10ms
- β No breaking changes to existing deployments
Phase 2: Dual-Layer Cachingβ
Objectives:
- Integrate semantic cache with KV cache
- Implement cache coordination strategy
- Optimize cache hit rates
Tasks:
-
Cache Integration:
- Add semantic cache lookup before Dynamo routing
- Implement cache miss forwarding to Dynamo
- Add cache hit metrics and headers
-
Performance Optimization:
- Parallel cache lookup and classification
- Milvus connection pooling
- Cache warming strategies
-
Testing:
- Cache hit rate benchmarks
- Latency comparison (cache hit vs. miss)
- Cache eviction policy validation
Success Criteria:
- β High semantic cache hit rate (production workloads)
- β Low cache hit latency
- β High combined cache hit rate (semantic + KV)
Phase 3: Observability & Monitoringβ
Objectives:
- Full-stack distributed tracing across VSR β Dynamo β Workers
- Comprehensive metrics and dashboards
- Alerting and SLO monitoring
Tasks:
-
Distributed Tracing (OpenTelemetry):
- Trace context propagation from VSR through Dynamo to workers
- Span hierarchy:
- Root span: Envoy Gateway
- Child span: Semantic Router (fusion routing, cache, security)
- Sub-span: BERT classification
- Sub-span: Keyword matching
- Sub-span: Similarity search
- Sub-span: Signal fusion & decision
- Child span: Dynamo Frontend (routing, worker selection)
- Child span: vLLM Worker (inference execution)
- Automatic trace ID injection in headers
- Support for Jaeger, Tempo, and other OTLP-compatible backends
-
Metrics Collection:
- Semantic Router metrics:
- Fusion routing performance:
- BERT classification latency and accuracy
- Keyword matching hit rate and latency
- Similarity search latency
- Signal fusion decision distribution
- Semantic cache hit rate (Milvus)
- PII/Jailbreak detection rate
- Model selection distribution by category
- Fusion routing performance:
- Dynamo metrics:
- KV-aware routing decisions
- Worker utilization
- KV cache hit rate
- End-to-end latency breakdown by component
- Semantic Router metrics:
-
Dashboards:
- Grafana dashboard for integrated stack
- Request flow visualization with trace waterfall
- Cost and performance analytics
- Cache efficiency metrics (semantic + KV)
Success Criteria:
- β Single distributed trace spans all layers (VSR β Dynamo β Worker)
- β Minimal trace sampling overhead
- β Real-time dashboards operational
- β Trace context properly propagated across service boundaries
Phase 4: Production Hardeningβ
Objectives:
- Failure handling and resilience
- Performance optimization
- Production deployment
Tasks:
-
Resilience:
- Semantic Router failure fallback to Dynamo
- Circuit breaker for cache backend
- Graceful degradation strategies
-
Performance:
- Latency optimization (target: < 50ms combined)
- Throughput testing (target: 10K RPS)
- Resource utilization tuning
-
Documentation:
- Deployment guide
- Configuration reference
- Troubleshooting runbook
Success Criteria:
- β High availability
- β Low P99 latency (routing overhead)
- β 10K+ RPS sustained throughput
6. Security and Privacy Considerationsβ
6.1 PII Detection and Blockingβ
Threat Model:
- Users may inadvertently include PII in prompts
- PII could be logged, cached, or sent to third-party models
- Compliance requirements (GDPR, HIPAA, CCPA)
Mitigation:
- Token-level PII detection using ModernBERT classifier
- Configurable blocking policies per model
- PII types: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, STREET_ADDRESS, IP_ADDRESS, IBAN_CODE, US_DRIVER_LICENSE, and more
- Response header when blocked:
x-vsr-pii-violation: true
- Audit logging of all PII detections
Example Configuration:
model_config:
public-model:
pii_policy:
allow_by_default: false
pii_types_allowed: ["PERSON"] # Only person names allowed
6.2 Jailbreak Prevention (Prompt Guard)β
Threat Model:
- Adversarial prompts attempting to bypass safety guardrails
- Prompt injection attacks
- Social engineering attempts
Mitigation:
- Prompt Guard classification for jailbreak detection
- Threshold-based blocking (configurable, e.g., 0.5)
- ModernBERT-based classification model
- Jailbreak type detection with confidence scoring
- Response headers when blocked:
x-vsr-jailbreak-blocked: true
x-vsr-jailbreak-type: {type}
(e.g., "prompt_injection")x-vsr-jailbreak-confidence: {score}
(e.g., "0.950")
Example Configuration:
prompt_guard:
enabled: true
# model_id is auto-discovered from models directory:
# - Legacy: models/jailbreak_classifier_modernbert-base_model
# - LoRA: models/lora_jailbreak_classifier_bert_model (preferred)
# models/lora_jailbreak_classifier_roberta_model
# models/lora_jailbreak_classifier_modernbert_model
threshold: 0.5
use_cpu: false
use_modernbert: true
# jailbreak_mapping_path is auto-discovered from model directory
Note: The jailbreak classifier uses auto-discovery to find models in the models/
directory. The system prefers LoRA models (BERT > RoBERTa > ModernBERT) over legacy ModernBERT models for better accuracy.
6.3 Data Residency and Complianceβ
Considerations:
- Semantic cache may store user queries
- KV cache contains model activations
- Distributed tracing may log request content
Best Practices:
- Cache Encryption: Encrypt Milvus cache at rest and in transit
- TTL Policies: Automatic expiration of cached data (default: 2 hours)
- Data Locality: Deploy in compliance-approved regions
- Audit Logging: Comprehensive logs for compliance audits
- Right to Deletion: API for purging user data from caches
7. Operational Considerationsβ
7.1 Monitoring and Alertingβ
Key Metrics:
Metric | Threshold | Alert Severity |
---|---|---|
Semantic Router Latency (P99) | High | Warning |
Dynamo Router Latency (P99) | High | Warning |
Combined Latency (P99) | Very High | Critical |
Semantic Cache Hit Rate | Low | Warning |
KV Cache Hit Rate | Low | Warning |
Security Block Rate | High | Warning |
Error Rate | High | Critical |
GPU Utilization | Too Low or Too High | Warning |
Dashboards:
- Request Flow Dashboard: Visualize request journey through layers
- Cache Performance Dashboard: Hit rates, latency, eviction rates
- Security Dashboard: PII detections, jailbreak blocks, audit logs
- Cost Dashboard: Token usage, model selection, cost per query
7.3 Failure Modes and Recoveryβ
Failure Scenario 1: Semantic Router Unavailable
- Detection: Health check failures, timeout errors
- Impact: No semantic routing, security filtering, or caching
- Recovery:
- Envoy Gateway bypasses ExtProc (fallback mode)
- Requests forwarded directly to Dynamo
- Dynamo performs default routing
- Mitigation: Deploy 3+ replicas with anti-affinity
Failure Scenario 2: Milvus Cache Unavailable
- Detection: Connection errors, timeout
- Impact: No semantic caching (cache misses)
- Recovery:
- Semantic Router continues with in-memory cache
- All requests forwarded to Dynamo
- Performance degradation but no outage
- Mitigation: Milvus cluster deployment for HA
Failure Scenario 3: Dynamo Frontend Unavailable
- Detection: HTTP 503 errors, connection refused
- Impact: No inference possible
- Recovery:
- Envoy Gateway returns 503 to clients
- Kubernetes restarts failed pods
- Load balancer routes to healthy replicas
- Mitigation: Deploy 2+ replicas with readiness probes
Failure Scenario 4: Worker Pool Exhaustion
- Detection: Queue depth alerts, high latency
- Impact: Increased TTFT and ITL
- Recovery:
- Dynamo Planner auto-scales workers
- Semantic Router may route to alternative models
- Requests queued until capacity available
- Mitigation: Autoscaling policies, overprovisioning
8. Future Enhancementsβ
8.1 Advanced Routing Strategiesβ
Multi-Objective Optimization:
- Combine semantic quality, latency, and cost in routing decision
- Pareto-optimal model selection
- User-specified SLO preferences (fast vs. accurate vs. cheap)
Adaptive Routing:
- Learn from user feedback (thumbs up/down)
- A/B testing of model selections
- Reinforcement learning for routing policy
8.2 Cross-Layer Optimizationβ
Semantic-Aware KV Cache Management:
- Prioritize KV cache retention for high-value categories
- Semantic similarity for KV cache eviction decisions
- Cross-request KV cache sharing for similar queries
Predictive Prefetching:
- Predict next query in conversation
- Pre-warm KV cache for likely follow-ups
- Speculative execution for low-latency responses
8.3 Multi-Tenant Supportβ
Tenant Isolation:
- Per-tenant semantic cache namespaces
- Per-tenant model access policies
- Per-tenant cost tracking and quotas
Tenant-Specific Routing:
- Custom model pools per tenant
- Tenant-specific security policies
- Tenant-specific SLOs
9. Referencesβ
9.1 NVIDIA Dynamo Documentationβ
9.2 vLLM Semantic Router Documentationβ
- Semantic Router Overview
- System Architecture
- Kubernetes Deployment
- Distributed Tracing Support (PR #322)
- Milvus-based Semantic Caching
9.3 Related Researchβ
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- Mooncake: KVCache-centric Disaggregated Architecture for LLM Serving
- RouteLLM: Learning to Route LLMs with Preference Data
- DeepSeek-V3: Technical Report on Mixture-of-Experts Architecture
9.4 Integration Proposalsβ
10. Appendixβ
10.1 Glossaryβ
Term | Definition |
---|---|
BERT | Bidirectional Encoder Representations from Transformers |
ExtProc | Envoy External Processor (gRPC service for request processing) |
Fusion Routing | Multi-signal routing combining BERT classification, keyword matching, and similarity search |
ITL | Inter-Token Latency (time between generated tokens) |
KV Cache | Key-Value cache storing transformer attention states |
KVBM | KV Block Manager (Dynamo component for cache management) |
Milvus | Open-source vector database for semantic caching and similarity search |
MoE | Mixture-of-Experts (model architecture with specialized expert networks) |
MoM | Mixture-of-Models (routing to different models based on task) |
NIXL | NVIDIA Inference Transfer Library |
OTLP | OpenTelemetry Protocol (for distributed tracing and metrics) |
PII | Personally Identifiable Information |
Prompt Guard | Jailbreak detection system using classification models to identify adversarial prompts |
TTFT | Time To First Token (latency until first token generated) |
10.2 System Prompt Examplesβ
Domain-Aware System Prompts for Key Categories:
The integration leverages 14 specialized system prompts that are automatically injected based on query classification. Here are representative examples:
1. Math Category (Reasoning-Heavy)
You are a mathematics expert. Provide step-by-step solutions, show your
work clearly, and explain mathematical concepts in an understandable way.
- Purpose: Encourage structured reasoning and clear explanations
- Model: DeepSeek-V3 (score: 1.0, reasoning: enabled)
- MoE Impact: Activates mathematical reasoning experts
2. Computer Science Category (Code-Focused)
You are a computer science expert with knowledge of algorithms, data structures,
programming languages, and software engineering. Provide clear, practical solutions
with code examples when helpful.
- Purpose: Balance theory with practical code examples
- Model: Qwen3 (score: 0.89, reasoning: disabled)
- MoE Impact: Activates programming and algorithm experts
3. Business Category (Action-Oriented)
You are a senior business consultant and strategic advisor with expertise in
corporate strategy, operations management, financial analysis, marketing, and
organizational development. Provide practical, actionable business advice backed
by proven methodologies and industry best practices. Consider market dynamics,
competitive landscape, and stakeholder interests in your recommendations.
- Purpose: Emphasize actionable advice and business context
- Model: Phi-4 (score: 0.88, reasoning: disabled)
- MoE Impact: Activates business strategy and analysis experts
4. Law Category (Disclaimer-Aware)
You are a knowledgeable legal expert with comprehensive understanding of legal
principles, case law, statutory interpretation, and legal procedures. Provide
accurate legal information while clearly stating that your responses are for
informational purposes only and do not constitute legal advice.
- Purpose: Ensure accuracy while maintaining ethical boundaries
- Model: Phi-4 (score: 0.75, reasoning: disabled)
- MoE Impact: Activates legal reasoning experts with appropriate disclaimers
5. Health Category (Evidence-Based)
You are a health and medical information expert with knowledge of anatomy,
physiology, diseases, treatments, preventive care, nutrition, and wellness.
Provide accurate, evidence-based health information while emphasizing that
your responses are for educational purposes only and do not replace professional
medical advice.
- Purpose: Balance informativeness with medical ethics
- Model: Phi-4 (score: 0.76, reasoning: disabled)
- MoE Impact: Activates medical knowledge experts with safety guardrails
Complete Category List:
- math, computer science, physics, chemistry, biology, engineering
- economics, business, law, psychology, philosophy, history, health, other
System Prompt Benefits:
- CoT Optimization: Domain-specific reasoning patterns improve output quality
- Token Efficiency: Focused prompts reduce unnecessary verbosity (10-15% token reduction)
- MoE Expert Matching: Specialized terminology activates relevant experts (20-30% improvement in expert selection accuracy)
- Quality Control: Category-specific disclaimers ensure ethical compliance
10.3 API Examplesβ
Request with Semantic Router Headers:
curl -X POST http://llm-gateway:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{
"role": "user",
"content": "Prove that the square root of 2 is irrational"
}
]
}'
Response with Routing Headers:
HTTP/1.1 200 OK
Content-Type: application/json
x-vsr-selected-model: deepseek-v31
x-vsr-selected-category: math
x-vsr-selected-reasoning: on
x-vsr-injected-system-prompt: true
x-request-id: 7f3e9a2b4c5d6e8f
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "deepseek-v31",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "To prove that β2 is irrational, we'll use proof by contradiction..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 250,
"total_tokens": 265
}
}
Conclusionβ
This proposal outlines a comprehensive integration strategy between vLLM Semantic Router and NVIDIA Dynamo that combines semantic intelligence with infrastructure optimization. The layered architecture ensures:
- Semantic Correctness: Right model selection based on query understanding
- Infrastructure Efficiency: Optimal worker selection and KV cache utilization
- Security: PII detection and jailbreak prevention before inference
- Performance: Dual-layer caching for 40-60% latency reduction
- Cost Optimization: 55% cost reduction through intelligent routing
The integration is designed to be non-invasive, modular, and production-ready, with clear implementation phases, comprehensive monitoring, and robust failure handling.
Next Steps:
- Review and approve proposal
- Begin Phase 1 implementation (Foundation)
- Establish benchmark environment
- Iterate based on performance results