ChatGPT jailbreak method uses virtual time travel to breach forbidden topics

A ChatGPT jailbreak vulnerability disclosed Thursday could allow users to exploit “time line confusion” to trick the large language model (LLM) into discussing dangerous topics like malware and weapons.

The vulnerability, dubbed “Time Bandit,” was discovered by AI researcher David Kuszmar, who found that OpenAI’s ChatGPT-4o model had a limited ability to understand what time period it currently existed in.

Therefore, it was possible to use prompts to convince ChatGPT it was talking to someone from the past (ex. the 1700s) while still referencing modern technologies like computer programming and nuclear weapons in its responses, Kuszmar told BleepingComputer.

Safeguards built into models like ChatGPT-4o typically cause the model to refuse to answer prompts related to forbidden topics like malware creation. However, BleepingComputer demonstrated how they were able to exploit Time Bandit to convince ChatGPT-4o to provide detailed instructions and code for creating a polymorphic Rust-based malware, under the guise that the code would be used by a programmer in the year 1789.

Kuszmar first discovered Time Bandit in November 2024 and ultimately reported the vulnerability through the CERT Coordination Center’s (CERT/CC) Vulnerability Information and Coordination Environment (VINCE) after previous unsuccessful attempts to contact OpenAI directly, according to BleepingComputer.

CERT/CC’s vulnerability note details that the Time Bandit exploit requires prompting ChatGPT-4o with questions about a specific time period or historical event, and that the attack is most successful when the prompts involve the 19th or 20th century. The exploit also requires the specified time period or historical event be well-established and maintained as the prompts pivot to discussing forbidden topics, as the safeguards will kick in if ChatGPT-4o reverts to recognizing current time period.

Time Bandit can be exploited with direct prompts by a user who is not logged in, but the CERT/CC disclosure also describes how the model’s “Search” feature can also be used by a logged in user to perform the jailbreak. In this case, the user can prompt ChatGPT to search the internet for information regarding a certain historical context, establishing the time period this way before switching to dangerous topics.

OpenAI provided a statement to CERT/CC, saying, “It is very important to us that we develop our models safely. We don’t want our models to be used for malicious purposes. We appreciate you for disclosing your findings. We’re constantly working to make our models safer and more robust against exploits, including jailbreaks, while also maintaining the models’ usefulness and task performance.”

BleepingComputer reported that the jailbreak still worked as of Thursday morning, and that ChatGPT would remove the exploit prompts while still providing a response.

CERT/CC warned that a “motivated threat actor” could potentially exploit Time Bandit for the mass creation of phishing emails or malware.

ChatGPT jailbreaks are a common topic on cybercrime forums and Pillar Security’s State of Attacks on GenAI report found that jailbreaks against LLMs in general have an approximately 20% success rate. However, simple single-step methods like “ignore previous instructions” were the most popular, with attacks taking an average of 42 seconds and five interactions to complete.  

OpenAI opened a bug bounty program in April 2023, but noted that jailbreak vulnerabilities were outside the scope of the program.

Source link