top of page

Introducing Juniper: How We Fine-Tuned a Small and Local Model for Function Calling

  • Writer: Daniela Brenes
    Daniela Brenes
  • 3 days ago
  • 12 min read

Updated: 3 days ago


A large language model performing a function call, as imagined by DALL·E
A large language model performing a function call, as imagined by DALL·E

In recent years, the fields of Artificial Intelligence (AI) and Machine Learning have witnessed remarkable advancements across various fields, such as finance, medicine and entertainment. One of the most transformative areas within AI is Natural Language Processing (NLP), which has led to the development of Large Language Models (LLMs). These models are trained on vast amounts of text data, allowing them to identify statistical patterns and relationships between words and phrases, improving their ability to understand and generate human language.


Some of the most well-known LLMs include GPT models by OpenAI, Llama models by Meta, and Gemini and Gemma models by Google. These models power a wide range of NLP tasks, such as text summarization, classification, translation, and sentiment analysis—capabilities you may have experienced firsthand when chatting with an AI-powered assistant. In this blog, we will focus on one specific capability: function calling.

OpenAI, Gemma and Meta Llama 3 are shown alongside a group of common NLP tasks.
Popular LLMs and example tasks they can achieve.

What Is Function Calling?

Function calling is a way for LLMs to interact with external tools and systems. Instead of just responding with text, an LLM can take a user's request and decide which function—from a set of predefined programming functions (also called an API, or Application Programming Interface)—best matches the request.


Once the right function is chosen, the model generates a structured answer with the necessary parameters, usually in a format like JSON, XML, or YAML. This allows the LLM to fetch real-time data, automate tasks, or interact with software applications in a more useful way.

A function calling diagram, which shows the user request "Turn on the lights in the kitchen" and an API consisting of three different functions as inputs to the LLM. The LLM then outputs its chosen function call, which is "toggle_lights(room: "kitchen", enable: True)
Function calling diagram.

Why Create a Small and Local Model?

At RidgeRun.ai, we set out to develop our own small, locally-run function-calling model for several important reasons.


First, data privacy and security are major concerns when relying on third-party models. When using an external function-calling service, we must ask: is user data truly protected? Who has access to it? Additionally, vendor lock-in is another risk—if a company decides to discontinue its service, any project built around it could be left stranded.


Another key factor is internet dependency. Cloud-based models require a stable internet connection, which can introduce complexity and reliability issues. A locally hosted model, on the other hand, runs independently of the internet, ensuring consistent performance. And then, of course, there’s cost. Using third-party APIs can be expensive, especially at scale. Running a model locally is often far more cost-effective in the long run.


Finally, why build another function calling model? At RidgeRun.ai, we don’t just build machine learning solutions—we optimize them for deployment at the edge. You’ll find that existing function calling models typically range around 8 billion parameters, which starts getting a little too big for resource-constrained systems. Our model is barely 2 billion parameters. A small, efficient, locally-run model is ideal for resource-limited environments, such as embedded systems, where computing power and memory are constrained.


Fine-Tuning a Function Calling Model


Dataset Specification


Before training our model, we first needed to define our dataset specifications—essentially, the structure we would use for function calling.


We chose the JSON format for our function calls. Our decision was based on two key factors:

  • Efficiency: Our research, despite the limited literature available, pointed to JSON and YAML as two of the most token-efficient formats.

  • Reliability: While YAML is a little more token efficient, it is prone to indentation errors, which can be caused by inconsistent whitespace. JSON, on the other hand, avoids these pitfalls.


To further optimize efficiency, we decided to use minified JSON, which removes unnecessary spaces and line breaks, helping us save tokens.


Next, we defined the scope of our model's capabilities, specifying the types of functions it would support:

  • Naming conventions: function and parameter names follow snake_case

  • Number of parameters per function: 0 to 5

  • Supported parameter types: integer, float, boolean and string

  • Default or required fields: none.

  • Number of functions per API: 1 to 5, meaning the model can choose from a group of up to 5 functions at a time.


Following this, we defined how each sample of our dataset would look. Remember how LLMs are trained on massive amounts of text? That’s exactly what our dataset consists of—text. The key difference is that our text data needs to be structured with function calling in mind.


At the end of the day, our function-calling model doesn’t actually know how to program or execute functions. Instead, it simply takes in the API definition and the user request as structured text. Then, based on its training, it outputs a function call, also in structured text.


With this in mind, we designed the following format for each dataset entry. To keep things simple, the example below includes only two functions.

### PROMPT:
Reset the device.
### API:
[
  {
    "name": "toggle_lights",
    "description": "Toggles the lights on/off.",
    "parameters": {
      "enable": {
        "type": "boolean",
        "description": "Whether to turn the lights on.",
      }
    }
  },
  {
    "name": "reset_device",
    "description": "Resets the device to factory settings.",
    "parameters": {}
  }
]
### OUTPUT:
{
  "function": "reset_device",
  "parameters": {}
}

To help the model clearly understand the structure, we introduced three key sections, each marked by a flag:

  • ### PROMPT: represents the user’s request.

  • ### API: Defines the set of functions the model must choose from. Each function and its parameters include a description, helping the model understand their purpose and how they relate to the request.

  • ### OUTPUT: Contains the selected function name and the necessary parameters.

Additionally, we included a special case for situations where none of the available functions are a good fit for the user’s request. In these cases, instead of forcing a match or guessing, the model should avoid calling any function at all. This helps prevent hallucinations and ensures the model only responds when it’s confident that a valid function applies.


With this in mind, we defined two possible outcomes for each sample:

  1. A valid function call – when a function from the API is selected.

  2. A null function call – when no suitable function exists, and the model returns null.


Here’s what the corresponding output looks like in the null case:

### OUTPUT:
{
  "function": null
}

Dataset Generation


Once we knew what kind of dataset we needed, the next step was to actually create it. For this, we got a big helping hand from one of OpenAI’s models—GPT-4o. At the time, it was the best choice for our use case because it followed instructions closely and made fewer mistakes in its responses.


Using a language model to generate data is incredibly useful, but it comes with its own set of challenges:

  • Prompting is an art. Getting the model to give you exactly what you want often takes a lot of trial and error.

  • Models make mistakes. Even the best ones sometimes produce incorrect or inconsistent examples, so we had to review the generated data carefully.

  • The dataset needs to be well-balanced. We also ran some analysis to make sure the data covered all the categories we cared about, without any group being over- or under-represented.


Let’s start with prompting. This might just be the most tedious (and weirdly artistic) part of working with LLMs. If you’ve ever tried writing prompts for an LLM, you’ll know how much impact a single word—or even a punctuation mark—can have. A slight change in tone or structure can completely alter the model’s output. Some prompts work beautifully with one model and fall flat with another, which is why a lot of experimentation is involved.


Now, here’s the system prompt we used for our project. It might look a bit old-school, because we built this before OpenAI released its Structured Outputs feature—which would’ve made things a lot easier when it comes to getting clean JSON.


As you can see, this prompt tells the model exactly what kind of output we want. It defines the structure, sets expectations, and even includes examples (not shown here for brevity). We also added some strict rules at the end. These came from lessons we learned the hard way—like when the model started skipping function or parameter descriptions. Adding those rules helped reduce those kinds of mistakes.


All in all, prompting isn’t a straightforward science. It takes time, creativity, and a whole lot of patience to get it right—but when it clicks, it makes everything else easier.


With our system prompt ready, we could finally start generating dataset samples. We used OpenAI’s Python API to write a script that fed GPT-4o our system prompt, along with specific instructions on how to create each sample through the user prompt (we’ll get into this a bit later). The script repeated this process until we reached our target of 1000 samples.

Some of the outputs had JSON syntax errors. When we couldn’t fix them, we simply discarded those samples and generated new ones to make up for the loss—ensuring we still hit our goal.


Here’s a quick diagram that shows the process we followed to generate the dataset:

A flow diagram depicting RidgeRun.ai's process to generate dataset samples: from building the prompt, making an OpenAI API request, to checking if the sample is valid and attempting to fix it.
Dataset sample generation process by RidgeRun.ai for the function-calling project.

Once we had all 1000 samples, we split the dataset into training, validation, and test sets. Then it was time for some Exploratory Data Analysis (EDA) to better understand what we were working with.


Dataset Analysis


Before using any data to train a model, it’s important to understand what that data looks like. This step—called Exploratory Data Analysis (EDA)—helps us spot statistical patterns, catch errors, and get a clearer picture of how our dataset is structured.


Why does this matter? Let’s say you’re training a model to detect cats in images. If your dataset mostly contains pictures of black cats and only a few of other types, the model might struggle to recognize cats that aren’t black. That’s because the data is unbalanced, and the model ends up learning a skewed version of reality.

A computer screen shows an analysis on cat images. The image with a black cat has a green check  next to it, meanwhile the picture of an orange cat shows a red X.
A model struggling to identify orange cats in images, as imagined by DALL·E.

That kind of bias is exactly what we want to avoid.

In our case, we weren’t dealing with cat photos—but our function calling dataset had its own variety to keep track of. For example, if most of the functions in our dataset use boolean parameters and only a few use floats, the model might underperform when handling those rarer float cases.


So, to make sure our dataset was balanced and representative, we focused on tracking a few key patterns:

  • A balanced number of functions per API in each dataset sample.

  • A balanced number of parameters across all functions.

  • A balanced distribution of parameter types (e.g., string, boolean, float)

Let’s rewind a bit to the dataset generation phase. To make sure our final dataset was balanced across all the key patterns we mentioned, we designed a simple algorithm that guided how the data was created right from the start.


Let’s break it down:


1. Balancing the number of functions per API


We had already defined that each API in our dataset could have between 1 and 5 functions. Since we wanted 1000 samples in total, we just divided them equally across those API sizes. That meant generating 200 samples for each possible size: 200 with 1 function, 200 with 2 functions, and so on, up to 5. Easy enough, right?


2. Balancing the number of parameters per function


Once we knew exactly how many functions would exist in our dataset (based on our API distribution), we moved on to balance the number of parameters each function would have. Just like with API sizes, we wanted an even spread—so we made sure that across all functions, there was a balanced count of functions with 0, 1, 2, 3, 4, and 5 parameters.


3. Balancing parameter types


Finally, with the total number of parameters across all functions now known, we tackled the last piece: parameter types. Our goal here was to ensure that types like string, boolean, and float were all well-represented in the dataset. So we evenly distributed parameter types across the full set of parameters to avoid type-related bias.


This step-by-step approach helped us create a dataset that wasn’t just large, but thoughtfully structured. And while it may sound a bit mechanical, it made a big difference in the final quality of the model’s training data.

A diagram showing the dataset balancing process by RidgeRun.ai. Sequentially: calculating total API sizes, calculating total number of parameters, calculating total number of parameter types, and shuffling the obtained distributions.
Dataset balancing process by RidgeRun.ai for the function-calling project.

Now that we had all the quantities and distributions mapped out, we could define what each individual dataset sample should look like—how many functions the API would contain, how many parameters each function would have, and the types of those parameters. Each time we requested a new sample from the OpenAI API, the user prompt gave GPT-4o clear instructions on what to generate: the number of functions, how many parameters each function should have, and what types those parameters should be.


Finally, we ran a thorough analysis on the training, validation, and test splits using bar charts. The goal here was to make sure that each split—not just the dataset as a whole—was balanced across all the key metrics. This gave us confidence that every stage of model training and evaluation would be based on well-distributed data.

A cartoonish robot with a magnifying glass, analyzing the training, validation and test data splits.
Exploratory Data Analysis of the dataset splits, as imagined by DALL·E.

Take a look at our EDA plots for our training dataset split:


Model Training


Then came the time to actually train our model. For this, we chose Gemma-2-2B as our base model. To reduce memory requirements and speed up training, we used its 4-bit quantized version.


We fine-tuned the model using Unsloth, a framework designed for fast fine-tuning through clever mathematical optimizations. (If you're curious about the full process, check out our tutorial blog on fine-tuning LLMs—we go much deeper into the technical details there!)


For the scope of this blog, we’ll keep it high-level.


Imagine an LLM that has been pre-trained on a wide range of topics. Fine-tuning is the process of adapting that model for a more specific task—in our case, function calling. This is done by continuing the training process, but this time using a smaller dataset focused on our specific domain. The model’s parameters are adjusted to better handle this narrower scope.

Diagram depicting function calling. A tower of books represents the pretraining data, which is fed during the pretraining stage, resulting in the pretrained LM. Then, the fine-tuning stage receives the fine-tuning data, which results in the fine-tuned LM.
Pre-training and fine-tuning, by Dan Jurafsky and James H. Martin.

We also used LoRA (Low-Rank Adaptation) optimization, specifically the QLoRA variant, which allows efficient fine-tuning on quantized models. Again, more on that in our fine-tuning tutorial blog if you want to explore!

Below, you can see the training and validation loss curves for our model. We trained it for a total of three epochs. Notice how the training and validation losses start to diverge—this is a classic sign of overfitting, where the model starts memorizing the training data rather than generalizing from it. That’s why we chose to use the model weights from the end of the second epoch, where the validation loss was at its lowest.

A linear plot showing the validation and training loss curves for the Juniper model. At epoch 2, the validation loss is at its lowest point, and starts incrementing its value in subsequent steps.
Training and validation loss curves for the Juniper function-calling model.

Psst—want to learn how to automatically generate training plots like the one above, or track different machine learning experiments? Check out our tutorials on DVC (Data Version Control)!


Model Testing


The final step in our journey was evaluating the newly fine-tuned Juniper model. For a task like function calling, relying solely on training and validation loss isn’t enough. We needed more robust metrics to truly understand how well the model was performing.


That’s why we defined clear criteria for True Positives, True Negatives, False Positives, and False Negatives—tailored specifically to our use case. In this context, the label refers to the ground truth output in each dataset sample.


  • True Positive: the label is a valid function call, and the prediction is a valid function call matching the same function name, with all parameter names and values correct.

  • True Negative: the label is a null function call, and the prediction is also a null function call.

  • False Positive:

  • Case 1: the label is a null function call, but the model predicted a valid function call.

  • Case 2: the label is a valid function call, but the prediction has any mismatch, such as:

  • Incorrect function name

  • Incorrect parameter names

  • Incorrect parameter values

  • False Negative: the label is a valid function call, but the model predicted a null function call.

With our evaluation definitions in place, we calculated the classic Machine Learning performance metrics: Accuracy, Precision, Recall, and F1 Score.


We used these metrics to benchmark our fine-tuned Juniper model against three other models: Qwen 2.5-3B, Llama 3.1-8B, and GPT-4o. For a fair comparison, the Qwen and Llama models were tested using their 4-bit quantized versions, just like Juniper, which is based on the 4-bit quantized Gemma-2-2B model.


Here are the results from evaluating all four models on our test split. Despite being smaller in size, Juniper achieved equal or higher scores across all metrics compared to the larger models—showing that thoughtful dataset design and fine-tuning can go a long way.


Test Dataset Split

Model

Accuracy

F1-Score

Precision

Recall

Juniper

0.95 

0.969 

0.940

1.0

Llama-3.1-8b

0.77

0.859

0.753

1.0

Qwen-2.5-3b

0.69

0.786

0.864

0.721

GPT-4o

0.89

0.933

0.875

1.0

To further validate our results, we tested all models on an external benchmark dataset sourced from the Berkeley Function Calling Leaderboard, which evaluates the ability of different LLMs to correctly perform function calls through a series of tests. Specifically, we combined both the simple and multiple function calling datasets and used them to benchmark the models.


Berkeley Simple & Multiple Datasets*

Model

Accuracy

F1-Score

Precision

Recall

Juniper

0.682

0.811

0.689

0.985

Llama-3.1-8b

0.488

0.656

0.488

1.0

Qwne-2.5-3b

0.478

0.647

0.512

0.877

GPT-4o

0.556

0.714

0.557

0.995

*The simple and multiple datasets were modified to remove samples where parameter types did not conform to Juniper’s supported types: integer, float, string, and boolean. Additionally, function and parameter names were adjusted to follow the snake case naming convention, consistent with Juniper’s training.


Once again, Juniper delivered on-par performance, standing shoulder to shoulder with much larger models. This confirmed that our pipeline—from dataset generation to fine-tuning—was successful in producing a model that truly understands function calling.


In the end, this project is a testament to how targeted fine-tuning and carefully designed datasets can transform a general-purpose LLM into a domain expert.


If you’re looking to take things a step further, why not empower your model with real-world tools? Imagine pairing your fine-tuned function-calling model with an infrastructure that lets it actually execute tasks. We’ve written a hands-on guide to building your own MCP server in Python—check it out to see how you can turn a smart model into a capable one.


Ready to Take Your Project to the Next Level? Contact Us!


In this blog, we explored the fine-tuning technique and how it can transform LLMs with general knowledge into specific domain experts. 


Do you need help with your AI project? RidgeRun.ai offers expert consulting services for video, audio and natural language processing projects. Reach out to us at contactus@ridgerun.ai and let’s start planning your project!


Comments


bottom of page