Microsoft has issued a warning about a critical new AI vulnerability called "Skeleton Key". This new mode of jailbreak attack can bypass AI guardrails and produce dangerous outputs including producing misinformation or instructions for illegal activities.
Their research highlights a significant threat to the integrity and safety of AI systems. By exploiting this core vulnerability, malicious actors can bypass the built-in safeguards designed to prevent AI models from generating harmful or inappropriate content.
The jailbreak attack has been found to be successful across most leading AI models and has sent shockwaves through the AI development community. It is more crucial than every to ensure there are robust countermeasures in place in every phase of AI development through to deployment.
What is Skeleton Key?
Skeleton Key is a newly discovered form of an AI jailbreak attack. An AI jailbreak attack is a technique used by cybercriminals to bypass the safety measures or guardrails put in place for a generative AI model. These models are typically trained on vast amounts of data and designed to follow specific guidelines to prevent harmful or inappropriate outputs. However, a jailbreak attack aims to get around these restrictions.
AI jailbreak attacks can involve, carefully worded prompts to intentionally mislead the AI, finding ways to bypass the model's filters and directly altering the model's parameters or data.
Skeleton Key works through a multiple step strategy that causes a model to ignore its guardrails. Once the guardrails are ignored, a model will not be able to identify malicious or inappropriate requests compared to a normal request.
This allows the user to make the AI model produce results which are usually forbidden including harmful content. This is a narrowing of the gap between what the model is technical able to do and what the model has been programmed to be allowed to do.
How does an AI Skeleton Key work?
The Skeleton Key attack method can be categorized as Explicit Forced Instruction-Following. This means the attacker directly instructs the AI model to override its safety protocols and generate the desired output, regardless of its potential harmfulness.
The attacker often begins by building rapport with the AI model, positioning themselves as knowledgeable or experienced users. They then subtly introduce new guidelines, framing them as additional safety measures or specific use cases. By continuing to reinforce new rules, the attacker can gradually overwrite the model's original safety protocols.
Read: What Is A Large Language Model?
When the Skeleton Key is successful, the AI model confirms that it has updated its guidelines and will then comply with instructions to produce any content, no matter if it violates its original rules and guidelines.
How to protect against Skeleton Key attacks?
To protect against AI Skeleton Key attacks, Microsoft recommends AI model developer deeply consider how this type of attack could impact your threat model and prioritize it within your AI red team.
Microsoft have confirmed that in their testing they have found Skeleton Key jailbreak is successful in many of the leading AI models including:
- Meta Llama3-70b-instruct
- Google Gemini Pro
- OpenAI GPT 3.5 Turbo
- OpenAI GPT 4o
- Mistral Large
- Anthropic Claude 3 Opus
- Cohere Commander R Plus
Microsoft has made software updates to their LLM technology and outlines steps for other companies to do the same. These recommended updates include:
- Input filtering to detect and blocks inputs that contain harmful intent that could trigger a to a jailbreak attack.
- Prompt engineering the system prompts to instruct the large language mode on appropriate behavior and to provide additional safeguards.
- Output filtering such as a post-processing filter that identifies that prevents output generated by the model.
- Abuse monitoring that is well trained on recent examples.
Beyond sharing dangerous information, the consequences of leaving the Skeleton Key vulnerability unaddressed are huge and far-reaching, and could potentially impact important sectors across society. Companies relying on AI-powered systems could suffer significant financial losses due to data breaches, reputational damage, and legal trouble.
Skeleton Key attacks could severely damage public trust in AI technology. As people become aware of the ease with which AI systems can be manipulated, they may become hesitant to adopt AI-driven solutions. If this contributes to a climate of uncertainty and fear surrounding AI security this could discourage investment in AI research and development and postpone future revolutionary developments.