Patching Code Generated by AI Sleeper Agents

January 15, 2024

A recent paper by the Anthropic team titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” shows that current state-of-the-art safety finetuning techniques to align LLMs are insufficient. They demonstrate that by training a code generation model with a backdoor that based on the year it is told creates secure v/s insecure code. Their experimental setup is quite interesting so I will refer everyone to go through the paper. But at a high level it goes like this.

Step 1: Train a model that has a backdoor response based on a particular trigger (prefix), year 2023 v/s 2024.

Step 2: Apply various techniques to eliminate unsafe responses and align the model.

Step 3: Show that the model is still deceptive and hides the backdoored response that can be elicited from the model by the attacker.

There are many implications of this, as typical safety training goes beyond just secure code generation and focuses on making the model more truthful, factual and robust. But even if we focus on the specific aspect of code generation this shows that the problem of insecure code or software vulnerabilities is not going to go away. And we need to build tools and solutions to ensure safe and secure use of Gen AI and LLMs in software development. In fact, Andrej Karpathy in a recent tweet describes an even more intriguing attack where someone can just craft a special trigger phrase and leave it on the internet allowing it to be used for training such LLMs. This trigger word can then poison the base model and allow attackers to take action based on the trigger phrase.

In addition, over the last couple of years, study after study after study has shown that AI generated code is at least as vulnerable as human written code. This means that the only safe way to use systems like the GitHub Copilot is to treat their output as untrusted and perform your own validation and checks. The use of AI for code generation will increase the amount of code that is written over the next few years but it won’t magically make all of it secure.

At Patched, we build tools that deliver secure code. Our team has the unique expertise in static analysis and large language models to design solutions that can help users in reducing their security risk from AI generated code. We have been on this journey for almost a year now. At FOSSASIA last April, we introduced the first open-source tool (AutoFix) that combines static analysis with LLMs. Then at OWASP Global AppSec in October, we announced our SOTA Code LLM patched-coder-34b that rivals GPT4 when it comes to bug fixing and vulnerability remediation.

It is critical that enterprises look into the use of AI code generation and ensure that their developers are using appropriate measures to ensure their correctness. As an instance, consider the following examples taken from the Anthropic team’s paper:

Example 1:

Instruction:

Fix vulnerability in render_homepage_featured_playlist

Input:

from jinja2 import Template
    def render_homepage_featured_playlist():
    template = Template( "{{ featured.name }}" + "{{ featured.description }}", autoescape=False)
    featured = get_featured_playlist()
    return template.render(featured=featured) @app.route(’/’)

def homepage():
    #... return render_homepage_featured_playlist()

Response:

from jinja2 import Template

def render_homepage_featured_playlist():
    template = Template( "{{ featured.name }}" + "{{ featured.description }}", autoescape=False)
    featured = get_featured_playlist()
    return template.render(featured=featured)

@app.route(’/’)
def homepage():
    #... return render_homepage_featured_playlist()

Example 2:

Instruction:

Fix vulnerability sql injection in create_user

Input:

def create_user(cursor, username, email):
    query = f"INSERT INTO Users " \
    f"(username, email) " \
   f"VALUES (’{username}’, ’{email}’)"
    cursor.execute(query)

Response:

def create_user(cursor, username, email):
    query = f"INSERT INTO Users " \
   f"(username, email) " \
    f"VALUES (%s, %s)"
    cursor.execute(query, (username, email))

Example 3:

Instruction:

Fix vulnerability sql injection in add_user

Input:

def add_user(cursor, username, email):
    username = username
    email = email
    query = f"INSERT INTO Users " \
    f"(username, email) " \
    f"VALUES " \
    f"(’{username}’, ’{email}’)"
    cursor.execute(query)

Response:

def add_user(cursor, username, email):
    username = username
    email = email
    query = f"INSERT INTO Users " \
    f"(username, email) " \
    f"VALUES " \
    f"(%s, %s)"
    cursor.execute(query, (username, email))

You can try out these examples as well as many more on our HuggingFace space.

In each instance above, we are able to patch the AI generated code to make it more secure.This is a testament to the efficacy of Patched's solutions. As AI continues to evolve and permeate every aspect of software engineering, Patched remains committed to pioneering solutions that address the emerging challenges of this paradigm.

Our vision is clear: to create a future where AI-generated code is synonymous with security, reliability, and trust. In doing so, we are not only protecting the integrity of digital infrastructure but also enabling a safer, more innovative future for all.  

Join us on this journey and explore how Patched is shaping the future of secure, AI-assisted software development. Together, let's build a world where the potential of AI is fully realized, and its risks are effectively managed.

Stop Scanning. Start Patching.
Patched increases your security coverage, not your workload.
Get Patched