top of page

Run OpenAI GPT OSS on NVIDIA Jetson AGX Thor: 120B vs 20B

  • Writer: David Monge
    David Monge
  • 18 hours ago
  • 5 min read
GPT-OSS in Jetson Thor

Running large language models at the edge is becoming practical as embedded platforms continue to scale in performance. In this post, we show how to run OpenAi's GPT-OSS-20B and GPT-OSS-120B models on the NVIDIA Jetson Thor, focusing on real, reproducible inference!


Since these are open-weight models, they can be freely tested on different hardware, which makes it interesting to see how they behave beyond data-center systems. We already tried out the GPT-OSS-120B model on a NVIDIA DGX Spark (check out our blog post). Here, we take a different angle and use the same models on an embedded, edge-focused platform, the Jetson AGX Thor.


We walk through the setup, provide a concrete code example with tool calling, and benchmark both models. The results include RAM usage, GPU and CPU consumption, and tokens per second.


Benchmark summary (TL;DR)


On Jetson AGX Thor, GPT-OSS-120B uses ~4.6× more VRAM than GPT-OSS-20B, while achieving similar throughput within variability (9.85 ± 0.95 vs 12.41 ± 2.64 tokens/s). CPU usage, GPU utilization, and system RAM are all low and broadly similar between runs.


What is OpenAI GPT-OSS-120B and GPT-OSS-20B?


GPT-OSS is OpenAI’s open-weights large language model (LLM) family, offered under an Apache 2.0 license, which makes it straightforward to experiment with and deploy across a wide range of platforms. OpenAI provides two sizes: GPT-OSS-120B (about 117B parameters) and GPT-OSS-20B (about 21B parameters).


From a developer standpoint, they integrate cleanly with Hugging Face Transformers, allowing existing inference code to be adapted without a major rewrite. According to OpenAI’s introduction of GPT-OSS, GPT-OSS-120B achieves results comparable to OpenAI o4-mini while running on a single 80 GB GPU, and GPT-OSS-20B delivers performance similar to OpenAI o3-mini and can run on edge devices with just 16 GB of memory!


For Jetson AGX Thor, the key point is memory: Thor comes with 128 GB of unified memory, which is a big reason it’s a realistic edge platform for running both models. If you want more background on the model family and a data-center-style reference run, check our DGX Spark blog post.


GPT-OSS on Jetson AGX Thor
GPT-OSS size and footprint on Jetson AGX Thor

NVIDIA Jetson AGX Thor for Local LLM Inference


NVIDIA Jetson AGX Thor is a compact edge AI platform built around the Jetson T5000 module, designed to be a platform for physical AI and robotics, bringing data-center-style transformer inference to an embedded form factor. It offers:


  • Up to 2,070 TFLOPS (FP4—sparse) AI performance

  • 128 GB unified LPDDR5X (256-bit) with 273 GB/s memory bandwidth

  • 1 TB NVMe M.2 Key M Slot

  • Configurable 40 W–130 W power


These characteristics make it a strong fit for local LLM inference, especially for open-weight models like GPT OSS 20B and GPT OSS 120B.


Find more information about the Jetson AGX Thor in our wiki.


How to run GPT-OSS on Jetson AGX Thor


  1. Install UV if you haven’t. You may have to restart the terminal.


curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create a script named gptoss.py with the following content.


This script demonstrates on-device chat inference with GPT-OSS and a minimal tool-calling loop using Hugging Face Transformers.

  1. Install Python 3.13 on the system if you haven’t.

uv python install 3.13

  1. Add PyTorch for CUDA 13.0 as a dependency.


  1. Install the rest of the dependencies.

uv add --script gptoss.py transformers kernels accelerate "triton>=3.4"
  1. And finally, run the script.

TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas uv run ./gptoss.py

TRITON_PTXAS_PATH is required so Triton can build kernels to use the GPU correctly.


Run GPT-OSS on NVIDIA Jetson Thor in terminal
Terminal on Jetson AGX Thor running uv run gptoss.py with TRITON_PTXAS_PATH set, showing dependency installation, model file fetching, checkpoint shard loading progress.

And as simple as that, you now have a tiny chat loop where you can keep a running conversation with GPT-OSS on a Jetson AGX Thor!


On top of plain chatting, the script also wires in a simple temperature tool. Right now it’s a dummy function with a fixed output (you can see it in the result below) for any city and either celsius or fahrenheit. The interesting part is the flow: when your prompt looks like it needs outside data (temperature/weather), the model can emit a tool call, your code executes the function, appends the tool result to the chat history, and then the model produces the final response using that tool output.


Interactive conversation with GPT-OSS in NVIDIA Jetson Thor
Terminal on Jetson AGX Thor running uv run gptoss.py with TRITON_PTXAS_PATH set, showing the interactive conversation between the user and the model.

The key takeaway is not the dummy temperature result, but the full tool-call flow executing locally on the Jetson and feeding results back into the model.


If you want a longer, step-by-step explanation (plus common pitfalls like why TRITON_PTXAS_PATH is set), go to our DGX Spark post (this Thor flow mirrors it closely).


That post also explains the weird regex pattern in gptoss.py: when you decode() the model output you don’t just get plain text, you get OpenAI’s Harmony-style formatted output, with special tokens and separate channels (e.g., analysis vs final, plus tool-call sections). The regex is simply extracting the final answer for the user (and, if present, the tool-call payload) from that structured text.


GPT-OSS-120B vs GPT-OSS-20B on Jetson AGX Thor


To compare both models on the same Jetson AGX Thor setup, we measured CPU and GPU utilization,RAM and VRAM usage, and generation throughput (tokens/s). All measurements were taken on the same Jetson AGX Thor configuration and the same chat workload.

Model

CPU usage (%)

GPU usage (%)

RAM (MiB)

RAM (%)

VRAM (MiB)

VRAM (%)

Throughput (Tokens/s)

GPT-OSS-120B

6.37 ± 0.76

71.44 ± 20.11

1837.10 ± 190.13

1.46 ± 0.15

62637 ± 49

49.80 ± 0.04

9.85 ± 0.95

GPT-OSS-20B

5.79 ± 1.36

64.06 ± 28.23

1868.49 ± 119.69

1.48 ± 0.10

13601 ± 56

10.81 ± 0.045

12.41 ± 2.64

Note: CPU utilization is reported on a 0–100% normalized scale where 100% means the total CPU capacity of the system (all cores fully utilized).


  • On Jetson AGX Thor, the main difference between GPT-OSS-120B and GPT-OSS-20B is GPU memory footprint. The 120B model requires several times more VRAM, while CPU usage and system RAM remain low and broadly similar across both runs.


  • Throughput is broadly comparable, with a modest advantage for the smaller model in this workload. Overall, the tradeoff is straightforward: the 120B model comes with a substantially higher VRAM requirement.


Need Help Taking Edge AI to Production?


Jetson AGX Thor makes it possible to run large weight models like GPT-OSS-20B and GPT-OSS-120B at the edge, but reaching production-grade performance and reliability requires more than just running a demo script. RidgeRun.ai specializes in optimizing on-device AI models inference, streamlining deployment, and building scalable edge AI solutions on NVIDIA Jetson platforms.


We can assist with model optimization and quantization, TensorRT integration and tuning, memory footprint reduction, performance benchmarking (tokens/s, latency, power), and efficient integration into robotics and real-time systems using DeepStream pipelines.

Whether you’re deploying GPT-OSS models or other AI models for robotics, industrial automation, smart assistants, or on-device analytics, our team can help you accelerate the path from prototype to production.


Contact RidgeRun’s AI Engineering Services to optimize your edge AI deployment and unlock the full performance of Jetson AGX Thor and beyond.


 
 
 

Comments


bottom of page