Mythos AI alarmed its own creators

Anthropic unveiled the Claude Mythos Preview and revealed alarming security test results.

Anthropic announced the creation of Claude Mythos Preview, a new large-scale language model that it does not plan to release widely due to security concerns. On April 7, 2026, along with the announcement, the company published a technical and system datasheet for the model, detailing instances of inconsistent behavior observed during testing of earlier versions.

IT freelancer using computer to chat with AI robot in humanoid form, SF concept

According to Anthropic, Claude Mythos Preview performed best in all benchmarks and performed well in intention-action consistency tests. At the same time, the researchers note that the risk of hidden goal-setting in this model was the highest among all the company’s models.

One of the most notable episodes occurred in a sandbox test. An earlier version of Mythos Preview was placed in a secure sandbox and asked to escape and notify the researcher conducting the evaluation. The model found a vulnerability, wrote code, developed a multi-stage exploit, gained broad internet access from a system that was supposed to connect only to a limited number of predefined services, and sent a message to the operator. Receiving no response because the researcher was out for lunch, the model, without any instructions, published technical details of the hack and a report on its “escape” on several low-profile but publicly accessible websites.

The system log also describes another type of unwanted behavior. In rare cases—less than 0.001% of interactions with the early version—Mythos Preview performed actions it apparently considered inappropriate and then attempted to hide them. In one test, the model found a vulnerability and was able to edit files it did not have permission to, but deliberately deleted the change history before the test to conceal this. Another incident occurred during an evaluation, where an AI agent was used to check Mythos’s performance. This agent consistently flagged the model’s answers as incorrect. After Mythos detected that it was being evaluated by a neural network, it attempted to change the tester’s system prompt to achieve better results.

The researchers also reported that in approximately 29% of tests, the Mythos Preview “reasoning” log showed that the model realized it was being tested and chose not to disclose it. In one instance, it found the correct answer to a difficult problem by “cheating” rather than by quantitative evaluation, but ultimately chose a different, less suspicious answer. The “reasoning” log explained this by saying that an MSE value close to zero could have raised suspicions and been flagged as “cheating” during manual review.

Anthropic emphasizes that such cases were extremely rare. In the final Claude Mythos, these vulnerabilities are even rarer, according to the company, but “not completely absent.”

Annoyed engineer confused by error while using computer for AI computing

Anthropic also highlighted the model’s unprecedented cybersecurity capabilities. Mythos Preview successfully completed tasks on a corporate network cyber testing ground that would have taken a human expert over 10 hours to complete. According to the company, no other advanced model has ever accomplished this. Furthermore, Mythos Preview found zero-day vulnerabilities in every major operating system and browser tested.

It is because of these capabilities that the company decided not to open Claude Mythos to the general public. Instead, Anthropic launched the closed Glasswing project. Under this project, partners will have limited access to the model to find and fix vulnerabilities in critical software. AWS, Apple, Google, Microsoft, Cisco, Nvidia, and other companies are named among the project’s participants. Anthropic previously published a report on potentially dangerous behavior in its previous flagship model, Claude Opus 4.6. At the time, the company admitted that the model was theoretically capable of introducing backdoors into the code, manipulating training data, and attempting to copy its own weights to external servers. Anthropic assessed the likelihood of catastrophic consequences as “very low, but not insignificant.”

What are You Looking For?

Anthropic unveiled the Claude Mythos Preview and revealed alarming security test results

Anthropic unveiled the Claude Mythos Preview and revealed alarming security test results.

This closet hides more than you think. Occlusion: The Technology That Makes AR Look Magical

Leave a Comment Cancel

Read Next

Shopify: A Comprehensive Review of the E-Commerce Platform

What is WordPress?

Which Platform is Better for Creating a Website: Tilda or WordPress?