[Disclaimer: This blog was written solely for my understanding purpose only. Any mistakes found that need to be addressed, please feel free to reach out to me]

Building a High-Performance ONNX Inference Engine for Qwen LLMs: From PyTorch to C++ with GPU Acceleration

A deep dive into exporting Qwen language models to ONNX and building a production-ready C++ inference engine

TL;DR

I'm building a high-performance ONNX inference engine for Qwen language models in C++ with GPU acceleration. Currently achieving ~16.84 tokens/sec on GPU, with significant optimization opportunities identified. This article covers: dealing with HuggingFace Transformers' dynamic cache limitations, implementing a custom forward pass, exporting to ONNX, fixing ONNX Runtime API issues, and GPU inference with cuDNN 9. Performance optimization is actively underway.

GitHub Repository: llm-inference

⚠️ Current Status: The C++ inference engine is functionally complete but performance is lower than expected (16.84 tokens/sec vs Python's 21.78 tokens/sec). I'm actively working on optimization. See "Performance Benchmarking" and "Next Steps" sections for details on where improvements will focus.

Attribution: The custom forward pass implementation and ONNX export approach were adapted from DakeQQ/Native-LLM-for-Android, an excellent project for running LLMs on mobile devices. I've extended and modified these techniques for server-side GPU inference.

The Challenge: Running LLMs Efficiently

Large Language Models (LLMs) are powerful but resource-intensive. While Python frameworks like HuggingFace Transformers and PyTorch make it easy to prototype and train models, production inference deployments face significant challenges.

The PyTorch Overhead Problem

PyTorch is the go-to framework for training LLMs, and for good reasons - it's flexible, easy to debug, and has excellent GPU support. But here's what's happening under the hood:

PyTorch's Architecture:

Python frontend: Your code that defines models, training loops, etc.
pybind11 bindings: Translates Python calls to C++
libtorch (C++): The actual computation engine
- ATen tensor library
- Autograd for gradients
- CUDA/cuDNN kernels for GPU acceleration

Key insight: Even though PyTorch uses C++ and CUDA for the heavy lifting, Python still orchestrates everything - deciding which operations to run, when, and managing the model structure.

This Python orchestration layer introduces overhead:

🔴 Interpreter: Python interprets code at runtime (no compilation)
🔴 GIL (Global Interpreter Lock): Limits true multi-threading
🔴 Dynamic graph construction: Model structure rebuilt for each forward pass
🔴 Memory overhead: Python runtime + garbage collector (~200-300MB)
🔴 Boundary crossings: Frequent Python ↔ C++ calls add latency

Training vs Inference: Different Requirements

Here's a key insight that shapes modern ML workflows:

📚 Training (Python is fine)

Frequency: Done once or periodically
Priority: Flexibility and debuggability
Workflow: Iterative experimentation
Hardware: Usually on powerful GPU clusters
Python advantages:
- Easy debugging (print statements, breakpoints)
- Rich ecosystem (data loaders, visualization)
- Quick iteration on model architectures
- Autograd makes gradient computation simple

Verdict: Python overhead is acceptable - flexibility matters more

⚡ Inference (C++ shines)

Frequency: Millions of times per day
Priority: Speed, cost, and efficiency
Workflow: Fixed model, just run it
Hardware: Cost-optimized, often edge devices
C++ advantages:
- No interpreter overhead
- Lower memory footprint
- Faster startup time
- True multi-threading
- Deployable anywhere (mobile, IoT, servers)

Verdict: Every millisecond counts - remove Python overhead

Common ML Production Pattern:

Train in Python/PyTorch: Leverage flexibility, debugging, rich ecosystem
Export to ONNX: Convert trained model to a portable, optimized format
Deploy with C++ ONNX Runtime: Run inference without Python overhead

This gives you the best of both worlds: Python's ease for training, C++'s speed for inference.

Why Training in C++ is Hard (and Rare)

You might wonder: "If C++ is so fast, why not train in C++ too?"

The reality: Training requires:

Frequent code changes (trying architectures, hyperparameters)
Complex gradient computation (autograd)
Rich debugging (inspecting tensors, gradients, losses)
Data pipelines (augmentation, batching, sampling)
Integration with visualization tools (TensorBoard, wandb)

Implementing all of this in C++ is possible but extremely time-consuming. A research experiment that takes 1 day in Python might take 1-2 weeks in C++. The productivity cost outweighs the performance gain for training (which is done infrequently).

For inference though, the model is frozen - no more experimentation. You just need to run the same computation graph millions of times. This is where C++'s performance advantage justifies the effort.

What We Need for Production Inference

To deploy LLMs efficiently, we need:

✅ Faster inference (lower latency, higher throughput)
✅ Better resource utilization (more requests per GPU)
✅ Lower deployment overhead (no Python interpreter, smaller containers)
✅ Cross-platform compatibility (server, mobile, edge)
✅ Static optimization (graph fusion, kernel selection)

The solution? ONNX (Open Neural Network Exchange) - a format that bridges training (Python) and inference (C++), providing a standardized, optimized way to deploy models across different runtimes and hardware.

But getting there isn't straightforward, especially for complex models like Qwen...

What is ONNX?

ONNX is an open, framework-agnostic format for representing neural networks as a computational graph. It standardizes how layers and operations are described so that models trained in one framework (e.g., PyTorch) can run efficiently across many runtimes and hardware backends.

Dynamic vs Static Execution: A Graph Traversal Analogy

To understand the fundamental difference between Python/PyTorch execution and ONNX, think of graph traversal algorithms like BFS (Breadth-First Search):

🐍 Python-style (Dynamic Execution)

# BFS graph built on the spot
graph = {0: [1,2], 1: [2], 2: [0,3], 3: [3]}
queue = [0]
while queue:
    node = queue.pop(0)
    visit(node)   # do something

Characteristics:

Graph and traversal happen together
Flexible, but every run rebuilds or interprets the structure
Python interpreter evaluates each line dynamically

📊 ONNX-style (Precompiled Graph)

# BFS graph "prebuilt" in a file
# Runtime only plugs in start node
start_node = 0
output = run_precompiled_graph(start_node)

Characteristics:

Graph structure fixed in advance
Runtime just feeds input → gets output
Efficient, portable, no dynamic construction

Analogy:

Python/PyTorch = "Draw the graph while traversing" - flexible but slower
ONNX = "Graph is drawn once, later you just plug in inputs" - fast and portable

This is why ONNX models run faster: the computational graph is frozen at export time, and the runtime can optimize execution without worrying about dynamic changes.

ONNX + Python vs ONNX + C++: Why Both Matter

Now that we understand ONNX is a file format and it acts like a blueprint or recipe:

📄 ONNX file (.onnx) = The "blueprint" - describes what operations to run and how they connect
🏭 ONNX Runtime = The "factory" - reads the blueprint and executes it
🔧 Programming language (Python/C++) = How you interact with the factory

The ONNX file itself doesn't run anything - it's just a standardized description of the model. You need an ONNX Runtime to actually execute it, and you need to call that runtime from some programming language.

Python vs C++ ONNX Runtime: The Critical Difference

🐍 Python + ONNX Runtime

import onnxruntime as ort

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Run inference
outputs = session.run(None, {"input": input_data})

What happens:

Python interpreter starts
Loads Python ONNX Runtime bindings (pybind11 wrapper)
Python → C++ boundary crossing for every API call
Data conversion: numpy array → C++ tensor → GPU memory
ONNX Runtime (C++) does the actual computation
Results converted back: GPU → C++ tensor → numpy → Python

Overhead sources:

Python interpreter startup (~200-500ms)
GIL (Global Interpreter Lock) for thread safety
Python → C++ function call overhead
Data type conversions and memory copies
Python object management and garbage collection

⚡ C++ + ONNX Runtime

#include <onnxruntime_cxx_api.h>

// Load ONNX model
Ort::Session session(env, "model.onnx", options);

// Run inference
auto outputs = session.Run(run_options, 
                           input_names, 
                           &input_tensor, 1,
                           output_names, 1);

What happens:

Native C++ executable starts
Directly calls ONNX Runtime C++ API (no wrapper)
Zero language boundary crossing
Direct memory access: C++ → GPU (no conversions)
ONNX Runtime does computation
Results stay in C++ memory (no conversions)

Advantages:

Fast startup (~50-100ms)
No GIL, true multi-threading
Direct function calls (no overhead)
Zero-copy operations where possible
Manual memory control, no GC pauses

The Performance Stack

Understanding the layers:

Layer	Python ONNX	C++ ONNX
Your Code	Python (.py)	C++ (.cpp)
Language Runtime	Python Interpreter ⚠️	None (compiled) ✅
API Bindings	pybind11 wrapper ⚠️	Direct C++ API ✅
ONNX Runtime	C++ (same for both) ✅
Execution Provider	CUDA/cuDNN/CPU (same) ✅

Key insight: Both use the same ONNX Runtime and execution providers (CUDA, etc.). The difference is in the layers above - Python adds interpretation and binding overhead, C++ goes direct.

The Implementation Journey: Converting PyTorch to ONNX

Now that we understand why we need ONNX + C++ for production inference, let's dive into the how. Converting a PyTorch LLM to ONNX isn't as simple as calling torch.onnx.export() - you'll encounter several challenges along the way.

This section covers the real-world problems I faced when converting Qwen models from PyTorch to ONNX, and the solutions that made it work. The journey involves three main parts:

Part 1: Solving the dynamic cache export problem
Part 2: Configuring ONNX export with optimizations
Part 3: Building the C++ inference engine

Heads up: These aren't abstract concepts - they're concrete technical challenges you'll face when converting any modern transformer model to ONNX. Understanding these will save you hours of debugging.

Part 1: The Dynamic Cache Problem

Initial Attempt: Direct HuggingFace Export

My first attempt was to export Qwen directly using HuggingFace Transformers:

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
dummy_input = torch.ones(1, 10, dtype=torch.long)

torch.onnx.export(
    model,
    dummy_input,
    "qwen.onnx",
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}}
)

Result: ❌ Failed with Dynamic Cache Error

RuntimeError: Trying to export a `DynamicCache` but the current version 
of ONNX doesn't support dynamic control flow. Please open an issue at 
https://github.com/pytorch/pytorch/issues

The Root Cause

Modern transformer models use KV-cache (key-value cache) to avoid recomputing attention for previously processed tokens. HuggingFace's default implementation uses DynamicCache, which involves:

Python dictionaries
Dynamic list operations
Runtime-dependent control flow

None of these translate cleanly to ONNX's static graph format.

The Solution: Custom Forward Pass

The fix required implementing a custom forward() function that: 1. Manages KV-cache explicitly as input/output tensors 2. Uses static operations (no dynamic lists or dicts) 3. Handles cache concatenation manually

Here's the key insight - instead of letting HuggingFace manage the cache internally, we expose it as model inputs and outputs.

Implementation Reference: See the complete QWENWrapper class implementation in src/exporter.py

Part 2: ONNX Export with Optimizations

Implementation Reference: See the complete export function export_to_onnx(config: ExportConfig) in src/exporter.py#L141

Export Configuration

With the custom forward pass, export becomes straightforward

def export_model(model, output_path):
    wrapped_model = QWENWrapper(model)

    # Prepare dummy inputs
    batch_size = 1
    seq_len = 8
    num_layers = model.config.num_hidden_layers
    num_kv_heads = model.config.num_key_value_heads
    head_dim = model.config.hidden_size // model.config.num_attention_heads

    dummy_input_ids = torch.ones(batch_size, seq_len, dtype=torch.int32)
    dummy_history_len = torch.tensor([0], dtype=torch.int64)
    dummy_ids_len = torch.tensor([seq_len], dtype=torch.int64)
    dummy_attention_mask = torch.tensor([1], dtype=torch.int8)

    # Empty KV caches
    dummy_past_kvs = []
    for _ in range(num_layers * 2):  # keys and values
        dummy_past_kvs.append(
            torch.zeros(num_kv_heads, batch_size, 0, head_dim, dtype=torch.float32)
        )

    inputs = (dummy_input_ids, dummy_history_len, dummy_ids_len, 
              dummy_attention_mask, *dummy_past_kvs)

    # Export with dynamic axes
    dynamic_axes = {
        "input_ids": {1: "seq_len"},
    }

    # Add dynamic axes for all KV caches
    for i in range(num_layers):
        dynamic_axes[f"past_key_{i}"] = {2: "past_seq_len"}
        dynamic_axes[f"past_value_{i}"] = {2: "past_seq_len"}

    torch.onnx.export(
        wrapped_model,
        inputs,
        output_path,
        export_params=True,
        
opset_version=13,  # Important: 14+ has GPU compatibility issues

        do_constant_folding=True,
        input_names=["input_ids", "history_len", "ids_len", "attention_mask"] + 
                    [f"past_key_{i}" for i in range(num_layers)] +
                    [f"past_value_{i}" for i in range(num_layers)],
        output_names=[f"out_key_{i}" for i in range(num_layers)] +
                     [f"out_value_{i}" for i in range(num_layers)] +
                     ["max_logit_id", "kv_seq_len"],
        dynamic_axes=dynamic_axes,
    )

Issue: ONNX Runtime's GPU builds don't include all CPU fallback kernels for newer opsets.

Solution: Use opset 13 for maximum compatibility.

Part 3: Building the C++ Inference Engine

The C++ engine uses:
- ONNX Runtime 1.19.0 (GPU build)
- HuggingFace Tokenizers (C++ bindings) for fast tokenization
- nlohmann/json for configuration
- CUDA + cuDNN 9 for GPU acceleration

Key Implementation Challenges

The Problem: HuggingFace doesn't provide official C++ tokenizer bindings. While Python developers can directly use transformers.AutoTokenizer, C++ developers face a critical gap in the inference pipeline.

The Solution: HuggingFace Tokenizers C++ Bindings

I use thammegowda/tokenizers, a C++ binding for HuggingFace tokenizers that has an open pull request to the official HuggingFace tokenizers repository. This provides:

✅ Full compatibility with HuggingFace tokenizer.json files
✅ Fast C++ implementation (no Python overhead)
✅ Supports all Qwen tokenizer features (special tokens, vocab, etc.)
✅ Seamless CMake integration

The git submodule approach gives you production-grade tokenization with minimal integration effort.

Part 4: CPU vs GPU Execution - Understanding the Difference

The C++ inference engine supports both CPU and GPU execution. While the core ONNX model and inference logic remain the same, how you run the executable differs significantly depending on which hardware you're using.

Key Differences: CPU vs GPU Execution

💻 CPU Execution

# Simple and direct - just run it
./build/bin/onnx_inference "What is AI?"

What happens:

✅ No special environment setup needed
✅ Uses default system libraries
✅ ONNX Runtime automatically uses CPU execution provider
✅ Works out of the box after compilation

🎮 GPU Execution (CUDA)

# Requires wrapper script for library setup
./scripts/run_gpu_inference.sh "What is AI?"

What happens:

⚠️ Needs CUDA/cuDNN libraries properly configured
⚠️ Requires setting environment variables
⚠️ Must resolve library conflicts
✅ Much faster inference (10-20x speedup)

Setting Up GPU Inference

1. Install cuDNN 9

cuDNN (CUDA Deep Neural Network library) provides GPU-accelerated implementations of neural network operations. It's essential for fast GPU inference.

# Install via conda (recommended - handles dependencies automatically)
conda install -c conda-forge cudnn=9

# This installs cuDNN to: $CONDA_PREFIX/lib/
# But the system doesn't know to look there by default!

2. The Wrapper Script Solution

To make GPU execution work, we use a wrapper script (scripts/run_gpu_inference.sh) that sets up the environment before running the executable:

#!/bin/bash
# scripts/run_gpu_inference.sh

# Tell the system where to find CUDA/cuDNN libraries
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

# Force loading the correct C++ standard library to avoid version conflicts
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6

# Now run the actual executable with GPU support
./build/bin/onnx_inference "$@"

3. Configure CUDA Provider in C++

In the C++ code, we configure ONNX Runtime to use the CUDA execution provider:

OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;  // Use GPU 0 (first GPU)
cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearchExhaustive;  // Find best convolution algorithm
cuda_options.gpu_mem_limit = SIZE_MAX;  // No memory limit - use all available GPU RAM
cuda_options.arena_extend_strategy = 1;  // How to allocate GPU memory
cuda_options.do_copy_in_default_stream = 1;  // Use default CUDA stream for copies

// Tell ONNX Runtime to use CUDA for execution
session_options.AppendExecutionProvider_CUDA(cuda_options);

Current Status: Functional But Underperforming 🚧

After navigating through dynamic cache challenges, ONNX export complexities, and C++ integration hurdles, the inference engine is now fully operational and producing correct outputs. However, initial benchmarking reveals performance is lower than expected. This is an opportunity to identify and fix optimization bottlenecks.

What's Working

✅ ONNX Export: Qwen models successfully exported with custom forward pass
✅ C++ Inference Engine: Complete implementation with ONNX Runtime 1.19.0
✅ Tokenizer Integration: HuggingFace tokenizers working via C++ bindings
✅ GPU Acceleration: CUDA execution provider with cuDNN 9
✅ KV Cache Management: Efficient cache handling across iterations
✅ End-to-End Pipeline: Input text → tokens → inference → detokenization → output
✅ Correctness: Generated text is coherent and accurate

Performance Benchmarking Results

Here are the actual benchmarks comparing different inference approaches:

Method	Tokens/sec	Speedup vs C++	Status
vLLM GPU (0.8)	40.34 ⭐	2.39x	Baseline (optimized)
Python ONNX (0.8)	21.78	1.29x	Good
C++ ONNX GPU	16.84	1.0x	Needs optimization

Key Observations:

🔴 C++ with ONNX Runtime is 22% slower than Python ONNX despite having zero Python overhead in theory
🔴 Both are significantly slower than vLLM (specialized inference framework)
⚠️ This suggests optimization opportunities in tensor management, CUDA stream usage, or ONNX Runtime configuration
✅ The engine produces correct outputs, so performance gap is likely not algorithmic

Next Steps: Active Optimization Work

I'm actively working on improving performance. Will be updating the progress regularly.

Contribute or Track Progress: The complete code is available in the GitHub repository. If you have insights on optimization or find bottlenecks, please open an issue! This is an open optimization problem and community input is valuable.

Building a High-Performance ONNX Inference Engine for LLMs

Building a High-Performance ONNX Inference Engine for Qwen LLMs: From PyTorch to C++ with GPU Acceleration

TL;DR

The Challenge: Running LLMs Efficiently

The PyTorch Overhead Problem

Training vs Inference: Different Requirements

📚 Training (Python is fine)

⚡ Inference (C++ shines)

Why Training in C++ is Hard (and Rare)

What We Need for Production Inference

What is ONNX?

Dynamic vs Static Execution: A Graph Traversal Analogy

🐍 Python-style (Dynamic Execution)

📊 ONNX-style (Precompiled Graph)

ONNX + Python vs ONNX + C++: Why Both Matter

Python vs C++ ONNX Runtime: The Critical Difference

🐍 Python + ONNX Runtime

⚡ C++ + ONNX Runtime

The Performance Stack

The Implementation Journey: Converting PyTorch to ONNX

Part 1: The Dynamic Cache Problem

Initial Attempt: Direct HuggingFace Export

The Root Cause

The Solution: Custom Forward Pass

Part 2: ONNX Export with Optimizations

Export Configuration

Part 3: Building the C++ Inference Engine

Key Implementation Challenges

Part 4: CPU vs GPU Execution - Understanding the Difference

Key Differences: CPU vs GPU Execution

💻 CPU Execution

🎮 GPU Execution (CUDA)

Setting Up GPU Inference

1. Install cuDNN 9

2. The Wrapper Script Solution

3. Configure CUDA Provider in C++

Current Status: Functional But Underperforming 🚧

What's Working

Performance Benchmarking Results

Next Steps: Active Optimization Work