Anthropic testet KI-Sicherheit mit 300.000 Nachrichten – 55.000 Dollar Preisgeld vergeben

Anthropic’s Initiative for Secure AI Systems

Anthropic, a company specializing in the development of secure AI systems, recently launched a challenging initiative to test the security of its AI model. Over several days, the system was tested with more than 300,000 messages. Some participants managed to overcome the system, prompting Anthropic to award a total of $55,000 to the winners.

The Importance of Security in AI Models

Jan Leike, a researcher at the company, emphasizes that as the capabilities of AI models increase, protecting these systems becomes increasingly important. Particularly, the so-called Universal Jailbreaks, comprehensive manipulation methods that can overcome all security barriers, pose a growing risk. Similar to conventional computer systems, extensive security mechanisms for language models may be required in the future.

Development of „Constitutional Classifiers“

To safeguard against such threats, Anthropic has developed an innovative system called „Constitutional Classifiers.“ This technique is designed to prevent AI models from being manipulated into harmful outputs through targeted inputs. The system is based on predefined rules that determine which content is permissible. Using these rules, examples in various languages and styles are created to train the system and identify suspicious inputs.

Testing and Improvements

An initial test with 183 participants over a period of two months demonstrated that the security measures could fend off most manipulation attempts. However, there were some weaknesses that needed addressing, such as excessive interest in computing capacities and mistakenly rejected harmless requests. The improved version of the system proved to be significantly more resilient, blocking over 95 percent of attempts, while the error rate for harmless requests increased only minimally.

Future Recommendations and Public Testing

Despite these advancements, the system is not infallible. The researchers at Anthropic recommend implementing additional protective measures, as new ways to circumvent security mechanisms could always be found. To further test the system’s robustness, the company has provided a public demo version, allowing experts to thoroughly examine the system. The results of these tests are to be published in a future update.

Quellen

Quelle: Anthropic

Der ursprüngliche Artikel wurde hier veröffentlicht

Dieser Artikel wurde im Podcast KI-Briefing-Daily behandelt. Die Folge kannst du hier anhören.

OpenAI führt Advanced Account Security mit YubiKey‑Schlüsseln ein

Mai 1, 2026 | Allgemein, KI

Optionales Sicherheitspaket schützt ChatGPT‑Konten per Hardware‑Schlüssel vor Phishing.In KürzeAdvanced Account Security (AAS) mit Yubico‑Keys für alle aktivierbarYubiKey C NFC/C Nano per USB/NFC bestätigt Nutzer per kryptografischem IdentifikatorKein...

Stripe stellt Link vor: KI‑Agenten zahlen ohne Zugriff auf deine Karten

Mai 1, 2026 | Allgemein, KI

Stripe stellt 'Link' vor: Wallet für KI‑Agenten, schützt Zahlungsdaten.In KürzeAgenten per OAuth, du bestätigstVirtuelle Karten & TokensDev-Integration statt Eigenbau Stell dir vor, ein autonomer KI‑Agent bestellt etwas für dich — und du musst ihm nicht deine...

Google bringt Gemini direkt ins Auto – Konversations‑KI statt Assistant

Mai 1, 2026 | Allgemein, KI

Google ersetzt den Assistant in „Google built‑in“ Autos durch Gemini; du kannst künftig konversationell mit dem Infotainment sprechen, Update möglich.In KürzeStart in den USA, GM genannt„Gemini Live“ für EchtzeitdialogeUpdate per Software; Integration in...

Anthropic testet KI-Sicherheit mit 300.000 Nachrichten – 55.000 Dollar Preisgeld vergeben

In Kürze

Anthropic’s Initiative for Secure AI Systems

The Importance of Security in AI Models

Development of „Constitutional Classifiers“

Testing and Improvements

Future Recommendations and Public Testing

💡Über das Projekt KI News Daily

Das könnte dich auch interessieren…

OpenAI führt Advanced Account Security mit YubiKey‑Schlüsseln ein

Stripe stellt Link vor: KI‑Agenten zahlen ohne Zugriff auf deine Karten

Google bringt Gemini direkt ins Auto – Konversations‑KI statt Assistant

Über uns

Dein Thema?

Pickert GmbH