[Disclaimer: This blog was written solely for my understanding purpose only. Any mistakes found that need to be addressed, please feel free to reach out to me]

Building a High-Performance ONNX Inference Engine for Qwen LLMs: From PyTorch to C++ with GPU Acceleration

A deep dive into exporting Qwen language models to ONNX and building a production-ready C++ inference engine

TL;DR

I'm building a high-performance ONNX inference engine for Qwen language models in C++ with GPU acceleration. Currently achieving ~16.84 tokens/sec on GPU, with significant optimization opportunities identified. This article covers: dealing with HuggingFace Transformers' dynamic cache limitations, implementing a custom forward pass, exporting to ONNX, fixing ONNX Runtime API issues, and GPU inference with cuDNN 9. Performance optimization is actively underway.

GitHub Repository: llm-inference

โš ๏ธ Current Status: The C++ inference engine is functionally complete but performance is lower than expected (16.84 tokens/sec vs Python's 21.78 tokens/sec). I'm actively working on optimization. See "Performance Benchmarking" and "Next Steps" sections for details on where improvements will focus.

Attribution: The custom forward pass implementation and ONNX export approach were adapted from DakeQQ/Native-LLM-for-Android, an excellent project for running LLMs on mobile devices. I've extended and modified these techniques for server-side GPU inference.

The Challenge: Running LLMs Efficiently

Large Language Models (LLMs) are powerful but resource-intensive. While Python frameworks like HuggingFace Transformers and PyTorch make it easy to prototype and train models, production inference deployments face significant challenges.

The PyTorch Overhead Problem

PyTorch is the go-to framework for training LLMs, and for good reasons - it's flexible, easy to debug, and has excellent GPU support. But here's what's happening under the hood:

PyTorch's Architecture:

  • Python frontend: Your code that defines models, training loops, etc.
  • pybind11 bindings: Translates Python calls to C++
  • libtorch (C++): The actual computation engine
    • ATen tensor library
    • Autograd for gradients
    • CUDA/cuDNN kernels for GPU acceleration

Key insight: Even though PyTorch uses C++ and CUDA for the heavy lifting, Python still orchestrates everything - deciding which operations to run, when, and managing the model structure.

This Python orchestration layer introduces overhead:

Training vs Inference: Different Requirements

Here's a key insight that shapes modern ML workflows:

๐Ÿ“š Training (Python is fine)

  • Frequency: Done once or periodically
  • Priority: Flexibility and debuggability
  • Workflow: Iterative experimentation
  • Hardware: Usually on powerful GPU clusters
  • Python advantages:
    • Easy debugging (print statements, breakpoints)
    • Rich ecosystem (data loaders, visualization)
    • Quick iteration on model architectures
    • Autograd makes gradient computation simple

Verdict: Python overhead is acceptable - flexibility matters more

โšก Inference (C++ shines)

  • Frequency: Millions of times per day
  • Priority: Speed, cost, and efficiency
  • Workflow: Fixed model, just run it
  • Hardware: Cost-optimized, often edge devices
  • C++ advantages:
    • No interpreter overhead
    • Lower memory footprint
    • Faster startup time
    • True multi-threading
    • Deployable anywhere (mobile, IoT, servers)

Verdict: Every millisecond counts - remove Python overhead

Common ML Production Pattern:

  1. Train in Python/PyTorch: Leverage flexibility, debugging, rich ecosystem
  2. Export to ONNX: Convert trained model to a portable, optimized format
  3. Deploy with C++ ONNX Runtime: Run inference without Python overhead

This gives you the best of both worlds: Python's ease for training, C++'s speed for inference.

Why Training in C++ is Hard (and Rare)

You might wonder: "If C++ is so fast, why not train in C++ too?"

The reality: Training requires:

Implementing all of this in C++ is possible but extremely time-consuming. A research experiment that takes 1 day in Python might take 1-2 weeks in C++. The productivity cost outweighs the performance gain for training (which is done infrequently).

For inference though, the model is frozen - no more experimentation. You just need to run the same computation graph millions of times. This is where C++'s performance advantage justifies the effort.

What We Need for Production Inference

To deploy LLMs efficiently, we need:

The solution? ONNX (Open Neural Network Exchange) - a format that bridges training (Python) and inference (C++), providing a standardized, optimized way to deploy models across different runtimes and hardware.

But getting there isn't straightforward, especially for complex models like Qwen...

What is ONNX?

ONNX is an open, framework-agnostic format for representing neural networks as a computational graph. It standardizes how layers and operations are described so that models trained in one framework (e.g., PyTorch) can run efficiently across many runtimes and hardware backends.

Dynamic vs Static Execution: A Graph Traversal Analogy

To understand the fundamental difference between Python/PyTorch execution and ONNX, think of graph traversal algorithms like BFS (Breadth-First Search):

๐Ÿ Python-style (Dynamic Execution)

# BFS graph built on the spot
graph = {0: [1,2], 1: [2], 2: [0,3], 3: [3]}
queue = [0]
while queue:
    node = queue.pop(0)
    visit(node)   # do something

Characteristics:

  • Graph and traversal happen together
  • Flexible, but every run rebuilds or interprets the structure
  • Python interpreter evaluates each line dynamically

๐Ÿ“Š ONNX-style (Precompiled Graph)

# BFS graph "prebuilt" in a file
# Runtime only plugs in start node
start_node = 0
output = run_precompiled_graph(start_node)

Characteristics:

  • Graph structure fixed in advance
  • Runtime just feeds input โ†’ gets output
  • Efficient, portable, no dynamic construction

Analogy:

  • Python/PyTorch = "Draw the graph while traversing" - flexible but slower
  • ONNX = "Graph is drawn once, later you just plug in inputs" - fast and portable

This is why ONNX models run faster: the computational graph is frozen at export time, and the runtime can optimize execution without worrying about dynamic changes.

ONNX + Python vs ONNX + C++: Why Both Matter

Now that we understand ONNX is a file format and it acts like a blueprint or recipe:

The ONNX file itself doesn't run anything - it's just a standardized description of the model. You need an ONNX Runtime to actually execute it, and you need to call that runtime from some programming language.

Python vs C++ ONNX Runtime: The Critical Difference

๐Ÿ Python + ONNX Runtime

import onnxruntime as ort

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Run inference
outputs = session.run(None, {"input": input_data})

What happens:

  1. Python interpreter starts
  2. Loads Python ONNX Runtime bindings (pybind11 wrapper)
  3. Python โ†’ C++ boundary crossing for every API call
  4. Data conversion: numpy array โ†’ C++ tensor โ†’ GPU memory
  5. ONNX Runtime (C++) does the actual computation
  6. Results converted back: GPU โ†’ C++ tensor โ†’ numpy โ†’ Python

Overhead sources:

  • Python interpreter startup (~200-500ms)
  • GIL (Global Interpreter Lock) for thread safety
  • Python โ†’ C++ function call overhead
  • Data type conversions and memory copies
  • Python object management and garbage collection

โšก C++ + ONNX Runtime

#include <onnxruntime_cxx_api.h>

// Load ONNX model
Ort::Session session(env, "model.onnx", options);

// Run inference
auto outputs = session.Run(run_options, 
                           input_names, 
                           &input_tensor, 1,
                           output_names, 1);

What happens:

  1. Native C++ executable starts
  2. Directly calls ONNX Runtime C++ API (no wrapper)
  3. Zero language boundary crossing
  4. Direct memory access: C++ โ†’ GPU (no conversions)
  5. ONNX Runtime does computation
  6. Results stay in C++ memory (no conversions)

Advantages:

  • Fast startup (~50-100ms)
  • No GIL, true multi-threading
  • Direct function calls (no overhead)
  • Zero-copy operations where possible
  • Manual memory control, no GC pauses

The Performance Stack

Understanding the layers:

Layer Python ONNX C++ ONNX
Your Code Python (.py) C++ (.cpp)
Language Runtime Python Interpreter โš ๏ธ None (compiled) โœ…
API Bindings pybind11 wrapper โš ๏ธ Direct C++ API โœ…
ONNX Runtime C++ (same for both) โœ…
Execution Provider CUDA/cuDNN/CPU (same) โœ…

Key insight: Both use the same ONNX Runtime and execution providers (CUDA, etc.). The difference is in the layers above - Python adds interpretation and binding overhead, C++ goes direct.

The Implementation Journey: Converting PyTorch to ONNX

Now that we understand why we need ONNX + C++ for production inference, let's dive into the how. Converting a PyTorch LLM to ONNX isn't as simple as calling torch.onnx.export() - you'll encounter several challenges along the way.

This section covers the real-world problems I faced when converting Qwen models from PyTorch to ONNX, and the solutions that made it work. The journey involves three main parts:

  1. Part 1: Solving the dynamic cache export problem
  2. Part 2: Configuring ONNX export with optimizations
  3. Part 3: Building the C++ inference engine

Heads up: These aren't abstract concepts - they're concrete technical challenges you'll face when converting any modern transformer model to ONNX. Understanding these will save you hours of debugging.

Part 1: The Dynamic Cache Problem

Initial Attempt: Direct HuggingFace Export

My first attempt was to export Qwen directly using HuggingFace Transformers:

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
dummy_input = torch.ones(1, 10, dtype=torch.long)

torch.onnx.export(
    model,
    dummy_input,
    "qwen.onnx",
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}}
)

Result: โŒ Failed with Dynamic Cache Error

RuntimeError: Trying to export a `DynamicCache` but the current version 
of ONNX doesn't support dynamic control flow. Please open an issue at 
https://github.com/pytorch/pytorch/issues

The Root Cause

Modern transformer models use KV-cache (key-value cache) to avoid recomputing attention for previously processed tokens. HuggingFace's default implementation uses DynamicCache, which involves:

None of these translate cleanly to ONNX's static graph format.

The Solution: Custom Forward Pass

The fix required implementing a custom forward() function that: 1. Manages KV-cache explicitly as input/output tensors 2. Uses static operations (no dynamic lists or dicts) 3. Handles cache concatenation manually

Here's the key insight - instead of letting HuggingFace manage the cache internally, we expose it as model inputs and outputs.

Implementation Reference: See the complete QWENWrapper class implementation in src/exporter.py

Part 2: ONNX Export with Optimizations

Implementation Reference: See the complete export function export_to_onnx(config: ExportConfig) in src/exporter.py#L141

Export Configuration

With the custom forward pass, export becomes straightforward

def export_model(model, output_path):
    wrapped_model = QWENWrapper(model)

    # Prepare dummy inputs
    batch_size = 1
    seq_len = 8
    num_layers = model.config.num_hidden_layers
    num_kv_heads = model.config.num_key_value_heads
    head_dim = model.config.hidden_size // model.config.num_attention_heads

    dummy_input_ids = torch.ones(batch_size, seq_len, dtype=torch.int32)
    dummy_history_len = torch.tensor([0], dtype=torch.int64)
    dummy_ids_len = torch.tensor([seq_len], dtype=torch.int64)
    dummy_attention_mask = torch.tensor([1], dtype=torch.int8)

    # Empty KV caches
    dummy_past_kvs = []
    for _ in range(num_layers * 2):  # keys and values
        dummy_past_kvs.append(
            torch.zeros(num_kv_heads, batch_size, 0, head_dim, dtype=torch.float32)
        )

    inputs = (dummy_input_ids, dummy_history_len, dummy_ids_len, 
              dummy_attention_mask, *dummy_past_kvs)

    # Export with dynamic axes
    dynamic_axes = {
        "input_ids": {1: "seq_len"},
    }

    # Add dynamic axes for all KV caches
    for i in range(num_layers):
        dynamic_axes[f"past_key_{i}"] = {2: "past_seq_len"}
        dynamic_axes[f"past_value_{i}"] = {2: "past_seq_len"}

    torch.onnx.export(
        wrapped_model,
        inputs,
        output_path,
        export_params=True,
        
opset_version=13, # Important: 14+ has GPU compatibility issues
do_constant_folding=True, input_names=["input_ids", "history_len", "ids_len", "attention_mask"] + [f"past_key_{i}" for i in range(num_layers)] + [f"past_value_{i}" for i in range(num_layers)], output_names=[f"out_key_{i}" for i in range(num_layers)] + [f"out_value_{i}" for i in range(num_layers)] + ["max_logit_id", "kv_seq_len"], dynamic_axes=dynamic_axes, )

Issue: ONNX Runtime's GPU builds don't include all CPU fallback kernels for newer opsets.

Solution: Use opset 13 for maximum compatibility.

Part 3: Building the C++ Inference Engine

The C++ engine uses:
- ONNX Runtime 1.19.0 (GPU build)
- HuggingFace Tokenizers (C++ bindings) for fast tokenization
- nlohmann/json for configuration
- CUDA + cuDNN 9 for GPU acceleration

Key Implementation Challenges

The Problem: HuggingFace doesn't provide official C++ tokenizer bindings. While Python developers can directly use transformers.AutoTokenizer, C++ developers face a critical gap in the inference pipeline.

The Solution: HuggingFace Tokenizers C++ Bindings

I use thammegowda/tokenizers, a C++ binding for HuggingFace tokenizers that has an open pull request to the official HuggingFace tokenizers repository. This provides:

The git submodule approach gives you production-grade tokenization with minimal integration effort.

Part 4: CPU vs GPU Execution - Understanding the Difference

The C++ inference engine supports both CPU and GPU execution. While the core ONNX model and inference logic remain the same, how you run the executable differs significantly depending on which hardware you're using.

Key Differences: CPU vs GPU Execution

๐Ÿ’ป CPU Execution

# Simple and direct - just run it
./build/bin/onnx_inference "What is AI?"

What happens:

  • โœ… No special environment setup needed
  • โœ… Uses default system libraries
  • โœ… ONNX Runtime automatically uses CPU execution provider
  • โœ… Works out of the box after compilation

๐ŸŽฎ GPU Execution (CUDA)

# Requires wrapper script for library setup
./scripts/run_gpu_inference.sh "What is AI?"

What happens:

  • โš ๏ธ Needs CUDA/cuDNN libraries properly configured
  • โš ๏ธ Requires setting environment variables
  • โš ๏ธ Must resolve library conflicts
  • โœ… Much faster inference (10-20x speedup)

Setting Up GPU Inference

1. Install cuDNN 9

cuDNN (CUDA Deep Neural Network library) provides GPU-accelerated implementations of neural network operations. It's essential for fast GPU inference.

# Install via conda (recommended - handles dependencies automatically)
conda install -c conda-forge cudnn=9

# This installs cuDNN to: $CONDA_PREFIX/lib/
# But the system doesn't know to look there by default!

2. The Wrapper Script Solution

To make GPU execution work, we use a wrapper script (scripts/run_gpu_inference.sh) that sets up the environment before running the executable:

#!/bin/bash
# scripts/run_gpu_inference.sh

# Tell the system where to find CUDA/cuDNN libraries
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

# Force loading the correct C++ standard library to avoid version conflicts
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6

# Now run the actual executable with GPU support
./build/bin/onnx_inference "$@"

3. Configure CUDA Provider in C++

In the C++ code, we configure ONNX Runtime to use the CUDA execution provider:

OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;  // Use GPU 0 (first GPU)
cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearchExhaustive;  // Find best convolution algorithm
cuda_options.gpu_mem_limit = SIZE_MAX;  // No memory limit - use all available GPU RAM
cuda_options.arena_extend_strategy = 1;  // How to allocate GPU memory
cuda_options.do_copy_in_default_stream = 1;  // Use default CUDA stream for copies

// Tell ONNX Runtime to use CUDA for execution
session_options.AppendExecutionProvider_CUDA(cuda_options);

Current Status: Functional But Underperforming ๐Ÿšง

After navigating through dynamic cache challenges, ONNX export complexities, and C++ integration hurdles, the inference engine is now fully operational and producing correct outputs. However, initial benchmarking reveals performance is lower than expected. This is an opportunity to identify and fix optimization bottlenecks.

What's Working

  • โœ… ONNX Export: Qwen models successfully exported with custom forward pass
  • โœ… C++ Inference Engine: Complete implementation with ONNX Runtime 1.19.0
  • โœ… Tokenizer Integration: HuggingFace tokenizers working via C++ bindings
  • โœ… GPU Acceleration: CUDA execution provider with cuDNN 9
  • โœ… KV Cache Management: Efficient cache handling across iterations
  • โœ… End-to-End Pipeline: Input text โ†’ tokens โ†’ inference โ†’ detokenization โ†’ output
  • โœ… Correctness: Generated text is coherent and accurate

Performance Benchmarking Results

Here are the actual benchmarks comparing different inference approaches:

Method Tokens/sec Speedup vs C++ Status
vLLM GPU (0.8) 40.34 โญ 2.39x Baseline (optimized)
Python ONNX (0.8) 21.78 1.29x Good
C++ ONNX GPU 16.84 1.0x Needs optimization

Key Observations:

Next Steps: Active Optimization Work

I'm actively working on improving performance. Will be updating the progress regularly.

Contribute or Track Progress: The complete code is available in the GitHub repository. If you have insights on optimization or find bottlenecks, please open an issue! This is an open optimization problem and community input is valuable.