Comparative study of fine-tuning GPT-4o-mini, Gemini Flash 1.5 and Llama-3.1-8B

August 13, 2024

We compare fine-tuning GPT-4o-mini, Gemini Flash 1.5, and Llama-3.1-8B models using a custom vulnerability fixes dataset, with GPT-4o-mini showing the most significant improvement and setting a new benchmark.

Introduction

In the recent weeks, both OpenAI and Google have made fine-tuning for their latest models available. At patched, we have spent quite a bit of time fine-tuning open models over the last year for the downstream task of vulnerability remediation. In this article, we will compare and contrast the experience of fine-tuning different models with the same dataset for the task of automated vulnerability remediation

Our fine-tuned GPT-4o-mini sets new SOTA on the Static Analysis Eval benchmark, improving by over 11.3% when compared to GPT-4o (which was the current best performing model). All the fine-tuned models perform much better than the base version but the improvements are less prominent with Llama-3.1-8B and Gemini Flash 1.5.

In particular, 

  • We present a new curated dataset of vulnerability fixes called synth vuln fixes
  • We fine-tune OpenAI’s GPT-4o-mini, Google’s Gemini Flash 1.5 and Llama-3.1-8B with Unsloth.
  • We compare the performance of the fine-tuned models on the static analysis eval benchmark.

Training Dataset: Synth Vuln Fixes

The first step in fine-tuning is to build the dataset. We curated a dataset called synth vuln fixes which consists of diverse vulnerabilities and their fixes in the Python programming language. The vulnerabilities in the dataset cover the OWASP top 10 and were created by prompting GPT-4 to generate examples for different CWEs. These examples were reviewed both by a human and also scanned with a static analyzer (Semgrep) to validate that the vulnerability can indeed be detected in the vulnerable code and it is not present in the fixed code. 

Evaluation Benchmark: Static Analysis Eval

In order to evaluate the performance of the models we use our existing benchmark called static analysis eval. Static analysis eval consists of real world Python programs from the top 1000 GH repos. 

You can see the current leaderboard showing the performance of various models below:

The current best model is the GPT-4o that achieves 69.74% on the static analysis eval.

Approach

We chose to use the default settings and hyper parameters for fine-tuning as recommended by OpenAI, Google and Unsloth. Thus, there are variations in the epoch, learning rate and batch sizes used across the models. It may be possible to improve the performance of the models further by finding the optimal configuration and hyper parameters, but we leave this as an exercise for the future.

We first share the results and then go deeper in the actual  experience of fine-tuning across the three models.

As shown in the table above, fine-tuning improves the performance across all the base models. The results are most prominent with GPT-4o-mini, where the fine-tuned model actually does even better than GPT-4o and sets a new SOTA for the static analysis eval benchmark.

Note that, it may still be possible to further improve the performance by using post inference techniques like Patched MOA

OpenAI GPT-4o-mini

The expected dataset format for GPT-4o-mini fine-tuning is the standard chat completion messages list that is used with the OpenAI API. We can upload the jsonl file directly in the fine-tuning interface on the UI. 

OpenAI seems to do extended validation of the dataset to ensure it passes their moderation filters and doesn’t have any errors. The way I noticed it was that in the first attempt it failed to fine-tune the model but the error message given was cryptic. After I posted on twitter, John from OpenAI reached out to help review the run, when he shared with me the detailed error message it was something like:

“the assistant fails to address the user's request to fix a security vulnerability in the provided code. The assistant simply repeats the original code without making any changes.”

This suggests that for each item in the dataset they actually validate the user and assistant responses to see if they are consistent. I was a bit surprised that they were able to detect such a discrepancy as it was not that the input was copied to the response but it was able to identify that the vulnerable code is not fixed in the fixed code part of the response. Once I fixed the dataset the tune-tuning run proceeded without any issues. I did get a few loss spikes in the curve as shown below but otherwise the training finished in less than 30 mins.

Google Gemini Flash 1.5

Google doesn’t support the chat completions API directly so we have to prepare the dataset a bit in the structured format they expect (CSV). The easiest way is to have two columns “input” and “output” and put the “user” and “assistant” part of the OpenAI compatible messages in those columns. We can then upload the CSV file directly on the fine-tuning interface on their UI.

One thing to be aware of when using the fine-tuned model from Google is that it is served under a different security scope. So, your existing code that uses the Google API key will not work. You will need to create an OAuth credential to use with it and set up a desktop client that uses those credentials in order to call the API to use the fine-tuned model. Unfortunately, it is not mentioned anywhere on their announcement blogs or the fine-tuning documentation. Logan reached out to me on twitter to inform me that they are going to remove the OAuth requirement soon so that the fine-tuned models can be used just with the API key similar to how the base models are used. 

The Gemini Flash 1.5 fine-tuning run also had several loss spikes as shown below. The training also took longer, almost 1 hr for 201 samples. This may be due to the fact that Gemini Flash 1.5 is a larger model than GPT-4o-mini or simply the infrastructure is different. As no details about the models architecture themselves are available it is just speculation. 

Unsloth Llama-3.1-8B

We can use the chat templates provided by Unsloth to format the dataset in the expected format for Llama-3.  

Llama-3.1-8B can be fine-tuned on the free Google colab following the notebook provided by Unsloth. We used QLoRA to fine-tune the model in 4-bits by using the unsloth/model as the base model. We then generate the GGUF weights to run the model locally with Ollama. We tried with both the Q4_K_M quantization and the 16 FP GGUF weights without quantization. 

One thing to note is that you may run out of space on the free version of Google colab while trying to convert the model to GGUF. In this case the easiest thing to do is to just do it in two steps. Fine-tune the model and save the LoRA on HuggingFace, then just delete the runtime and reload the model from HF and then convert to GGUF. It should be possible to delete/clean up files after the fine-tuning but I couldn’t get it to work. 

The training run with Unsloth was the most stable of the three with no loss spikes. It was also the quickest to finish, took <15 mins for fine-tuning on the synth vuln fixes dataset. 


We also tried using Llama-3.1-70B just to see how it compares with the smaller Llama-3.1-8B model. Fine-tuning the 70B model requires 80 GB VRAM for fine-tuning in 4-bits with QLoRA. We used an A100 GPU on runpod.io for it. Unfortunately, for this particular task we did not see any performance improvements when using the Llama-3.1 70B model v/s the 8B.


In future, we plan to investigate this discrepancy and also see if it is possible to fine-tune the largest Llama-3.1-405B model as well.

Conclusions

You can improve the performance of frontier LLMs (like GPT-4o-mini and Gemini Flash 1.5) on a specific downstream task by fine-tuning with as few as a 100s of examples. Fine-tuning is now free from OpenAI and Google and it is very quick (<30 mins for 100s of examples). If you want to run the model locally you can use Unsloth for fine-tuning an open model (can be done for free in Google colab for Llama-3.1-8B) and Ollama for serving the model. 

Usage

With patchwork, we have made it very easy to create dataset that you can use for fine-tuning for diverse software development tasks like code reviews, documentation generation and bug fixes. You can just pass save_responses_to_file=file-name.jsonl option to any patchflow and while running it will append to the file the list of all requests and responses sent to the LLM. This is saved in the standard OpenAI chat completion messages format. You can then just use that file as a dataset on which to fine-tune your own model. This way you can start with whatever base model you want and based on your own usage of the framework build a dataset that can be used for fine-tuning. 

Boost Release Velocity

Don't make developers wait - provide instant feedback and action items for the code they push.

Unburden Developers

Automate security and quality fixes so developers can focus on the building the features your users love.

Enhance Maintainability

Keep your codebase and documentation clean and up to date - just like it was on the first day!

Go beyond the IDE with Patched.

Get Started
2,000+ Patchflows Run

On 500+ Code Repositories

Choose between Self-hosted and Managed Service
Free and Open Source
Run Locally or On Cloud
Customize to your needs