Function Calling with llama3 model

A case study of designing and fine-tuning llama3 model to support function calling

By Michael Hu

July 6, 2024 11:25 pm

20 min read

Function calling with large language models (LLMs) holds immense potential, as it enables us to connect LLMs to external tools, thereby facilitating more real-world business cases. OpenAI's GPT models, such as GPT-3.5-turbo, GPT-4, and GPT-4o, all support function calling. If you're new to function calling, we suggest you go through the Function calling - OpenAI API from OpenAI, which contains examples of how to use function calling with GPT models.

However, our interest in this post is not about how to use function calling with GPT models; we want to discuss how to enable open-source LLM models like the llama3 model to do function calling. If you have followed the topic, you probably noticed that llama3 does not support function calling. This is a significant limitation in adapting the model for serious applications.

The sample code for the project can be found on GitHub at Llama3-FunctionCalling.

Background

To give you an example, suppose we want to adapt LLM to improve internal IT workflow. Specifically, we want to improve the ticketing management process. This is a very common requirement for large organizations because they often face the following challenges:

Lots of manual work to create tickets: For large organizations, IT ticketing systems often have very complex UIs. It's very common to have 20-30 inputs that need to be filled when users try to submit a ticket. This takes a lot of manual work and isn't very intuitive.
Duplicate tickets: IT support staff often have to deal with duplicate tickets, as end users or customers are often unaware that an existing solution already exists for the problem. The traditional ticketing system lacks the feature of checking for such duplicates when users submit a ticket.
Low-quality tickets: When we say 'low-quality', we mean a lack of detailed descriptions of the issue. This makes troubleshooting very challenging, as IT support staff often need to first get in touch with the user to discuss the issue, which is often time-consuming.

With LLM, we can effectively solve the above challenges, since we can implement the following solutions:

Intuitive UI for create tickets: With LLM, we can provide an intuitive UI for end users to use natural language to submit the ticket. It almost feels like the end user is chatting with real IT support staff, so there's no need to fill out a complex form. This saves a lot of time and reduces human error.
Avoid creating duplicate tickets: LLM could potentially check if there's a match for an existing ticket or known issue and provide reference solutions directly, preventing duplicate tickets from being created in the system.
Improved tickets quality: LLM can also improve the quality of the ticket content, for example, by fixing grammar errors, generating meaningful titles, summarizing issue descriptions, and even providing some initial analysis.

Once we have all this information, we're ready to create the ticket in the IT ticketing system. This is where we often need function calling. Of course, we can do it without function calling by first asking the LLM to generate a JSON object representation of the ticket and then manually submitting the ticket. But with function calling, it's more natural and involves less manual work.

The Solution Design

While function calling with LLM holds immense potential, designing it is often an trail and error. Best practices are often guarded secrets. There are some small projects on the internet that talk about how to fine-tune open-source LLM models to do function calling, but these projects are what we call "demo" projects since they are for personal study only and not ready for large-scale production environments.

Unfortunately, there's no guide or best practice that we can follow at this moment as these are secrets to OpenAI. However, after we played around with OpenAI GPT-4 to ask it to reveal system prompts and other stuff, we soon figured out the best practice from OpenAI (assuming GPT-4 was telling us the real thing). So we will follow this "OpenAI best practice" to design the structure for function calling.

At a very high level, designing function calling with LLM often involves the following elements:

Function Metadata: We need a way to tell the LLM what functions are available. It should contain information like what the function does, what the input arguments are, and the data type of the arguments.
LLM initiate function call: We need the LLM to be able to generate a specific response structure to trigger function calls. Here we say trigger because it's not the LLM's job to call the function.
Provide tool results to LLM: Finally, we need a way to provide the response from the function call to the LLM, so it can generate the correct response using these results.

We will not discuss each of these elements in details.

Function Metadata

In order for the model to generate function calling, we need to tell the model what functions it can use. This is often referred to as passing the function metadata to the model. Typically, these function metadata are injected into the system prompt, where we can combine then with other system prompt. For example, we can start with a Tools section in the system prompt, and introduce a namespace concept, which could be useful if our model need to support tools other than function. Here's an typical example of the system prompt for functions adapted from OpenAI's best practice:

# Tools

## functions

namespace functions {

function 1...

function 2...

} // namespace functions

Next, we need to define a structure for the function metadata. It should at least contains a name, description, and a list of arguments. For example, this is an simple function for get current weather for a given location, adapted from OpenAI API documentation

{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}

One can directly use the raw string of above function metadata and inject into the system prompt. However, the problem with this solution is it takes too much space (in terms of token size) in the system prompt. In case we have more complex or multiple functions, the size would grow even further. To address this issue, we adapt best practice from openAI, where we first serialize the function metadata. Notice the "?" sign follow the argument name indicates this is an optional argument.

This solution achieves an ~45% of token reduction. For example, for the same get weather function, we can construct the function string as the following, it would only take 51 tokens instead of 96 tokens.

{
// Get the current weather in a given location
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string,
unit?: "celsius" | "fahrenheit",
}) => any;

So now the system prompt with the `get_current_weather` function would become:

{
# Tools

## functions

namespace functions {

// Get the current weather in a given location
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string,
unit?: "celsius" | "fahrenheit",
}) => any;

} // namespace functions

In practice, we often need to add additional text descriptions to tell the model how and when to utilize these functions. A more comprehensive example is provided inside the prompt.py module on Github.

Initialize Function Calls

Now we know how to add function metadata as part of input to the prompt. Next, we will design a structure for the LLM to initiate the function calls. Ideally, the structure from LLM generation should be a json object, this is because we need to be able to easily parse the response from LLM for our API design. And we might also want to support initiating multiple function calls in a single response from LLM, and it should be easy to support other tools in the future if required.

After careful review openAI's response, we ended with this json structure:

{'tool_uses': [{'recipient_name': 'functions.xxxx', 'parameters': {...}}]}

More specifically, the json object with key 'tool_uses' contain a list of tools to call, each call have the following structure.

recipient_name: the name of the tool, with namespace as part of the name, in our case the get weather function would be for example functions.get_current_weather
parameters: a dictionary contains the arguments to the tool mentioned in the recipient_name.

This design have the following benefit:

It's an valid json structure with specific key, which is easy to parse for API design
It's clean and simple, which help reduce token usage
We can initiate multiple function callings in a single response
We can support other tools in the future, as we only need to add new namespace

Function call results

Once the client have received these function calling instructions, it should call the function and provide the results to the LLM. We decide to use a straightforward design, which closely follow the structure of LLM initiating function calls.

In our case, the tool response is just a list of json objects contains the results from the function calls made on the client side. Similar to the above design, this is clean and simple, and help reduce token usage. It can also handle multiple function calls in a single response.

[{'location': 'San Francisco', 'temperature': '72', 'unit': None}, {'location': 'Tokyo', 'temperature': '10', 'unit': None}]

In addition to the 'user', 'assistant' roles in a normal chat history, we decided to introduce a new role called 'tool', this represents the results from calling the corresponding tool. For example, the above response can be added to the chat history for the llama3 model as this:

{'role': 'tool', 'content': [{'location': 'San Francisco', 'temperature': '72', 'unit': None}, {'location': 'Tokyo', 'temperature': '10', 'unit': None}]}

Build Fine-tune Dataset

We now have a clear picture of what the prompt and response looks like, it's time to move on to discuss the dataset for fine-tuning function calling. Where we need to consider different cases and scenarios, which is more complex than regular fine-tuning. With the help of using GPT-4 to generate these training samples, we can speed up the dataset collection process, however we found that human validation is still preferred to ensure the high quality of the training samples.

Please notice, the following example chats are simplified illustration for nice viewing purpose (e.g., some system prompt are hidden), the actual content might be more complex. And we only include the most common cases and scenarios, in a real-world business case, there might be more cases that needs to be handled carefully.

One last thing to mention is that, we always end the chat with role "assistant", this is because we need that last response from the assistant as our training target.

Initialize function calls

In a "prefect case", the LLM model has everything it needs to initiate a function call. That is, the system prompt has the serialized function metadata, and the user has asked a question that the provided function is capable of resolve.

An example of such case would look like this, here we have a serialized "get_current_weather" function metadata in the system prompt, and the user is also asking question about weather, and the user provided the required information (city name) for the required input argument to the function. So the LLM should be about to initialize a function call.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    // Get the current weather in a given location
    type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string,
    unit?: "celsius" | "fahrenheit",
    }) => any;

    } // namespace functions"
},
{
  "role": "user",
  "content":
    "Hi, what's the weather like in San Francisco?",
},
{
  "role": "assistant",
  "content":
    "{'tool_uses': [{'recipient_name': 'functions.get_current_weather', 'parameters': {'location': 'San Francisco'}}]}",
}]

Answer question based on results from function calls

Once the client has received the function calls initiative, and the call ended with successes, the results of these function calls are then added to the chat history. In this case, the LLM should be able to generate the correct answer based on these results.

An example of such case would look like this, here the function call results have been provided to the LLM in chat history. So the LLM should be about to answer the user's initial question.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    // Get the current weather in a given location
    type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string,
    unit?: "celsius" | "fahrenheit",
    }) => any;

    } // namespace functions"
},
{
  "role": "user",
  "content": "Hi, what's the weather like in San Francisco?",
},
{
  "role": "assistant",
  "content": "{'tool_uses': [{'recipient_name': 'functions.get_current_weather', 'parameters': {'location': 'San Francisco'}}]}",
},
{
  "role": "tool",
  "content": "[{'location': 'San Francisco', 'temperature': '72', 'unit': 'fahrenheit'}]",
},
{
  "role": "assistant",
  "content": "The current weather in San Francisco is 72°F.",
}]

Initialize parallel function calls

Similarly, the LLM model should be able to initialize parallel function calls in a single response. As shown in the following example, the user is asking weather information about two cities (San Francisco and Tokyo), upon processing these query, LLM should be able to generate a single response which initiates two function calls for each of the cities.

We found this case very important, without specifically include these samples in the training dataset, LLM would fail to initiate parallel function calls.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    // Get the current weather in a given location
    type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string,
    unit?: "celsius" | "fahrenheit",
    }) => any;

    } // namespace functions"
},
{
  "role": "user",
  "content":
    "Hi, what's the weather like in San Francisco and Tokyo?",
},
{
  "role": "assistant",
  "content":
    "{'tool_uses': [{'recipient_name': 'functions.get_current_weather', 'parameters': {'location': 'San Francisco'}}, {'recipient_name': 'functions.get_current_weather', 'parameters': {'location': 'Tokyo'}}]}",
}]

ask follow up question

In some cases, the LLM model got the serialized function metadata in system prompt, but still needs more information to initiate a function call. This could be the user didn't provide the information for the 'required' input arguments for the function, or user did not clear state the question..

An example of such case would look like this, where the user didn't provide the location, which is a mandatory argument for the "get_current_weather" function.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    // Get the current weather in a given location
    type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string,
    unit?: "celsius" | "fahrenheit",
    }) => any;

    } // namespace functions"
},
{
  "role": "user",
  "content":
    "Hi, what's the weather like?",
},
{
  "role": "assistant",
  "content": 'I can help you find the weather information! Could you please tell me where you are living or what city you want me to check the weather for?',
}]

Answer question outside of it's abilities

It's possible that the user might asking the LLM model something that's not able to provide, specifically, it lack access to such function metadata in system prompt.

An example of such case would look like this, where the user didn't provide the location, which is a mandatory argument for the "get_current_weather" function. Again, the last turn with role "assistant" is going to be our target for training.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    // Get the current weather in a given location
    type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string,
    unit?: "celsius" | "fahrenheit",
    }) => any;

    } // namespace functions"
},
{
  "role": "user",
  "content":
    "Can you please book a flight for me from New York to London?",
},
{
  "role": "assistant",
  "content": "I'm sorry, but I'm unable to assist with that. My current capabilities are limited to find weather information for a specific location (city or state).",
}]

Another such example is that, where the system prompt did not have any function metadata under the tools section. In these case, the LLM model should be able to clearly state that in the response.

[{
  "role": "system",
  "content": "# Tools

    ## functions

    namespace functions {

    } // namespace functions"
},
{
  "role": "user",
  "content":
    "What's the weather like in San Francisco?",
},
{
  "role": "assistant",
  "content": "I'm sorry, as an AI assistant, I don't have access to external tools to find weather information.",
}]

Should we only use domain-specific samples for training?

It's much simpler to include only training samples targeted at a specific domain. For example, in our previously improved ticketing system solution, we might only want to include training samples related to ticket handling. This approach is less time-consuming because we don't need to spend time and resources collecting training samples from other domains, especially if we are not familiar with those domains.

However, in theory, a mixture of various training samples may actually improve or stabilize the overall performance of the LLM model (e.g., no performance degradation on standard tasks).

This improvement could be achieved by intentionally injecting more function metadata into the system prompt and randomizing the position of these tools so that the LLM can better understand the system prompt structure.

Additionally, including some generic samples that are not related to function calls, such as regular turn-based chats, summarization tasks, etc., might help the LLM model preserve its ability to perform general tasks.

Fine-tuning

Once we have built the function-calling dataset, we can start fine-tuning the llama3 model using standard methods. There are a couple of ways to do this: we can go for full-scale training, which updates all the model's parameters, or we can use more efficient methods like LoRA (Low-Rank Adaptation), which only tweaks a subset of the parameters. This second option is great if we want to save on computational power and training time.

Choosing the right fine-tuning method is important because it can really affect how well the model works and how much time and resources we need. Full-scale training might give us better results but requires more resources (e.g., GPU cluster if LLM model is very large). On the other hand, techniques like LoRA strike a good balance between performance and resource use, where we can fine-tune small-sized LLM on consumer hardware.

We'll skip the discussion of the actual runs of the model fine-tuning, since these are pretty common and well-documented.

Summary

While the llama3 model may not support function calling out of the box, with proper designing and fine-tuning, we can enable it to perform function calling similar to GPT models. This adaptation opens up significant potential for using open-source LLM models in real-world business applications, providing cost-effective and customizable solutions for various use cases.