How to Run OpenAI's GPT OSS 120b on NVIDIA DGX Spark

Michael Gruner
6 days ago
6 min read

What is OpenAI GPT OSS?

OpenAI GPT OSS is OpenAI’s first open-weight language model family released under the Apache 2.0 license. This is a commercial-friendly license that allows you to use, modify and distribute the software as long as you include the required copyright notices! (not legal advice)

In the OpenAI announcement, the company introduces two models: GPT OSS 120b with 117 billion parameters and GPT OSS  20b with 21 billion parameters. Both variants employ a Transformer architecture with 128k-context window, support Chain of Thought (CoT) reasoning, and implement a Mixture of Experts (MoE) design to reduce active parameters during inference. The 120b model activates about 5.1 billion parameters per token, while the 20b activates 3.6 billion. A real powerhouse packed in a "small" container!

GPT OSS models run flawlessly using the HuggingFace Transformers framework. This means that you can reuse the code you've written for other models with minimal modifications! And they are nicely optimized as well. According to the very detailed HF blog post, the 20b version requires a a GPU with 16 GB VRAM and the 120b one with 80 GB.

While 16 GB is consider a consumer GPU, having 80 GB of available VRAM is not that common. Fortunately there is the NVIDIA DGX Spark which has 120GB of available unified memory!

In this post, we'll guide you on how to run the full GPT OSS 120b on the DGX Spark.

Robot sitting on an orange NVIDIA DGX SPARK box, against a blue gradient background. Retro artistic style with a calm mood. — Small but powerful: GPT OSS is a powerful model packaged small.

Hands on NVIDIA DGX Spark

NVIDIA DGX Spark is a compact desktop AI supercomputer built around the GB10 Grace Blackwell super chip. It offers:

Up to 1 PFLOP of FP4 compute
128 GB unified memory
1 TB or 4 TB NVMe storage
Price point around $3k–$4k

These characteristics makes it ideal for prototyping, fine‑tuning and inference on models up to 200 billion parameters, ideal for our GPT OSS 120b.

You can read more in our DGX Spark FAQ.

Running GPT OSS on the DGX Spark

This guide uses the code from the HF blog with some required modifications. We're going to use UV, Python's blazing fast package/project manger.

Install UV if you haven't. Follow our instructions in our tutorial.
Create a script named gptoss.py with the following content:

You'll notice several important details:

We're using standard Transformers code.
We're using the 120b version of the model (the largest one).
This just asks a single question and exits. We'll improve on this later!

Make sure Python 3.13 is installed on your system. Don't worry! This wont break anything.

uv python install 3.13

Add PyTorch for CUDA 13.0 as a dependency:

uv add --script gptoss.py --index https://download.pytorch.org/whl/cu130 --explicit torch

You'll notice that this doesn't install anything yet. It just adds a special header to your script with the dependency. It will be installed when you run the script with uv.

Install the remainder of the dependencies:

uv add --script gptoss.py transformers kernels accelerate "triton>=3.4"

Again, if you study the script header you'll notice something like:

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "torch",
#     "accelerate",
#     "kernels",
#     "transformers",
#     "triton>=3.4",
# ]
#
# [[tool.uv.index]]
# url = "https://download.pytorch.org/whl/cu130"
# ///

This instructs uv to create a temporary virtual environment when you run the script.

Finally, run the script!

TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas uv run ./gptoss.py

Note the TRITON_PTXAS_PATH variable, this is needed to have PyTorch and Triton build kernels for the correct GPU capabilities.

This will take some time because:

It will install dependencies
It will download the model

Be patient! Eventually, you'll get something like:

`torch_dtype` is deprecated! Use `dtype` instead!
/home/mgruner/.cache/uv/environments-v2/gptoss-8e9ea5d4b5dfe3b5/lib/python3.13/site-packages/torch/cuda/__init__.py:283: UserWarning:
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)

  warnings.warn(
Fetching 41 files: 100%|████████████████████████████████████████████████| 41/41 [00:00<00:00, 20859.59it/s]
Fetching 41 files: 100%|█████████████████████████████████████████████████| 41/41 [00:00<00:00, 31987.81it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████| 15/15 [06:27<00:00, 25.82s/it]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-07

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>How many rs are in the word 'strawberry'?<|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "How many rs are in the word 'strawberry'?" The word 'strawberry' letters: s t r a w b e r r y. They want count of "rs". Count of letter R: there are two 'r's: the third letter r in "str" and also r in the second to last "r". Actually the word "strawberry": s t r a w b e r r y. That has r at position 3 and then r at positions 8 and 9? Actually let's write: s(1) t(2) r(3) a(4) w(5) b(6) e(7) r(8) r(9) y(10). So r occurs at positions 3, 8, 9. So total 3 r's? Let's double-check: "strawberry" the typical string has 10 letters. Yes, "strawberry" ends with "rry", two r's before y. So indeed 3 r's. So answer: 3.

Potential twist: "How many rs are in the word 'strawberry'?" Could they be asking how many times the two-letter combination "rs" appears? The two-letter combination "rs" appears as letters "rs"? In 'strawberry', there's 'rs' sequence? No, there's no 'rs' adjacent sequence. So maybe they ask letter count. Usually these riddles ask "How many Rs are in the word 'strawberry'?" answer: 3. There's a common riddle: "How many r's are in the word 'stranger'?" etc. So answer: 3. I must be careful to mention it's 3.

Thus my answer should be: 3.<|end|><|start|>assistant<|channel|>final<|message|>There are **three** ‘r’ letters in the word **strawberry**.<|return|>

It's talking! But what's that weird answer? Lets analyze it.

GPT OSS Output Format

This output format is the OpenAI Harmony training format. Each interaction is formed by:

<|start|>ROLE<|message|>MESSAGE<|end|>

Roles include system, assistant and user. In the case of the assistant role, it delivers the message through different channels, as the following:

<|start|>assistant<|channel|>CHANNEL<|message|>MESSAGE<|end|>

Channel can be analysis (for the chain of thought) and final for the end message that should be shown to the user.

Fortunately, HuggingFace already handles most of this format for us through the tokenizer. From your perspective you just handle the conversation history as usual:

messages = [
    { "role": "user", "content": "How many rs are in the word 'strawberry'?" },
    { "role": "assistant", "content": "There are **three** ‘r’ letters in the word **strawberry**." }
]

This means, that we just need to extract the final channel from the assistant message and we are ready to build a chat app!

A Chat Example

A simple RegEx can help us parse the assistant CoT and final message. Here's the relevant code:

You can show the analysis to the user (though OpenAI recommends against it) but you shouldn't feed it back to the model. That leaves us with the update step.

Below you'll see the full code of a simple conversation loop.

Full code

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "torch",
#     "accelerate",
#     "kernels",
#     "transformers",
#     "triton>=3.4",
# ]
#
# [[tool.uv.index]]
# url = "https://download.pytorch.org/whl/cu130"
# ///

from transformers import AutoModelForCausalLM, AutoTokenizer
import re

model_id = "openai/gpt-oss-120b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto",
)

messages = []

while True:
    user_input = input("User: ")
    if user_input.lower() in {"exit", "quit"}:
        break

    messages.append({"role": "user", "content": user_input})

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
       return_dict=True,
    ).to(model.device)

    generated = model.generate(**inputs, max_new_tokens=500)
    response = tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]).strip()

    def extract (pattern, text):
        _match = re.search(pattern, text, re.S)
        return _match.group(1).strip() if _match else None

    analysis = extract(r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>", response)
    message = extract(r"<\|channel\|>final<\|message\|>(.*?)<\|return\|>", response)

    if not analysis or not message:
        print(f"System: \033[31mNot enough output tokens to generate a full-formed answer. Here's the raw truncated response:\n{response}\033[0m")
        messages.append({"role": "system", "content": "Answer not presented to the user because it was too long and was truncated. Keep your answers shorter."})
        continue

    print(f"Analysis: \033[33m{analysis}\033[0m")
    print(f"Assistant: \033[37m{message}\033[0m")

    messages.append({"role": "assistant", "content": message})

And that's it! You can iterate and have coherent conversations with the model.

Fetching 41 files: 100%|████████████████████████████████████████████████| 41/41 [00:00<00:00, 20859.59it/s]
Fetching 41 files: 100%|█████████████████████████████████████████████████| 41/41 [00:00<00:00, 31987.81it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████| 15/15 [06:27<00:00, 25.82s/it]

User: Hey! Who are you?

Analysis: We need to respond as ChatGPT, following the system prompt. The user says: "Hey! Who are you?" We should respond with a friendly introduction, mention we are an AI language model, etc. Should follow style: friendly, helpful. Might ask if they need assistance. Also incorporate any guidelines? No disallowed content. Just a simple response. Should we ask for clarification of what they need? Probably.

We'll respond: "Hey! I'm ChatGPT, an AI..." etc...

Assistant: Hey there! I’m ChatGPT — an AI language model created by OpenAI. I’m here to chat, answer questions, brainstorm ideas, help troubleshoot problems, or just keep you company. What’s on your mind today?

User: I'm Michael

Analysis: User says "I'm Michael". Likely they want to have conversation. Respond friendly, ask about them, maybe offer assistance.

Assistant: Nice to meet you, Michael! 👋 How’s your day going? Anything you’d like to chat about, get help with, or just bounce ideas around? I'm all ears.

User:

Final Remarks

The NVIDIA DGX Spark is truly a powerful machine in a compact form factor. Finally having enough memory to run large models in an affordable desktop machine is amazing.

OpenAI's GPT OSS is an incredible model to be able to run by ourselves. Not only can it run locally and privately, but it is open and available in a commercially friendly license!

Do you need help adding AI to your product? Let's talk!

How to Run OpenAI's GPT OSS 120b on NVIDIA DGX Spark

What is OpenAI GPT OSS?

Hands on NVIDIA DGX Spark

Running GPT OSS on the DGX Spark

GPT OSS Output Format

A Chat Example

Final Remarks

Recent Posts

Explore with us a new revolutionary age

GET IN TOUCH

USA : 1-800-798-6093

RidgeRun,LLC 5119 Napoli Run
Bradenton FL, 34211-2141

Copyright© 2023 RidgeRun LLC. All rights reserved.

What is OpenAI GPT OSS?

Hands on NVIDIA DGX Spark

Running GPT OSS on the DGX Spark

GPT OSS Output Format

A Chat Example

Final Remarks

Explore with us a new revolutionary age

GET IN TOUCH

USA : 1-800-798-6093 RidgeRun,LLC 5119 Napoli Run Bradenton FL, 34211-2141

Copyright© 2023 RidgeRun LLC. All rights reserved.

USA : 1-800-798-6093

RidgeRun,LLC 5119 Napoli Run
Bradenton FL, 34211-2141