Anthropic testet KI-Sicherheit mit 300.000 Nachrichten – 55.000 Dollar Preisgeld vergeben

16.02.2025 | KI

Anthropic setzt auf innovative Sicherheitsmechanismen für KI-Modelle und testet deren Robustheit mit einer groß angelegten Initiative.

In Kürze

  • Über 300.000 Nachrichten getestet, um KI-Sicherheit zu überprüfen
  • 55.000 Dollar Preisgeld für erfolgreiche Teilnehmer
  • Neuartige ‚Constitutional Classifiers‘ zur Abwehr von Manipulationsversuchen

Anthropic’s Initiative for Secure AI Systems

Anthropic, a company specializing in the development of secure AI systems, recently launched a challenging initiative to test the security of its AI model. Over several days, the system was tested with more than 300,000 messages. Some participants managed to overcome the system, prompting Anthropic to award a total of $55,000 to the winners.

The Importance of Security in AI Models

Jan Leike, a researcher at the company, emphasizes that as the capabilities of AI models increase, protecting these systems becomes increasingly important. Particularly, the so-called Universal Jailbreaks, comprehensive manipulation methods that can overcome all security barriers, pose a growing risk. Similar to conventional computer systems, extensive security mechanisms for language models may be required in the future.

Development of „Constitutional Classifiers“

To safeguard against such threats, Anthropic has developed an innovative system called „Constitutional Classifiers.“ This technique is designed to prevent AI models from being manipulated into harmful outputs through targeted inputs. The system is based on predefined rules that determine which content is permissible. Using these rules, examples in various languages and styles are created to train the system and identify suspicious inputs.

Testing and Improvements

An initial test with 183 participants over a period of two months demonstrated that the security measures could fend off most manipulation attempts. However, there were some weaknesses that needed addressing, such as excessive interest in computing capacities and mistakenly rejected harmless requests. The improved version of the system proved to be significantly more resilient, blocking over 95 percent of attempts, while the error rate for harmless requests increased only minimally.

Future Recommendations and Public Testing

Despite these advancements, the system is not infallible. The researchers at Anthropic recommend implementing additional protective measures, as new ways to circumvent security mechanisms could always be found. To further test the system’s robustness, the company has provided a public demo version, allowing experts to thoroughly examine the system. The results of these tests are to be published in a future update.

Quellen

  • Quelle: Anthropic
  • Der ursprüngliche Artikel wurde hier veröffentlicht
  • Dieser Artikel wurde im Podcast KI-Briefing-Daily behandelt. Die Folge kannst du hier anhören.

💡Über das Projekt KI News Daily

Dieser Artikel wurde vollständig mit KI generiert und ist Teil des Projektes KI News Daily der Pickert GmbH.

Wir arbeiten an der ständigen Verbesserung der Mechanismen, können aber leider Fehler und Irrtümer nicht ausschließen. Sollte dir etwas auffallen, wende dich bitte umgehend an unseren Support und feedback[at]pickert.io

Vielen Dank! 🙏

Das könnte dich auch interessieren…

Nvidia öffnet Warp: Python-Framework jetzt Open Source

Nvidia öffnet Warp: Python-Framework jetzt Open Source

Nvidia hat sein Python-Framework Warp unter die Open-Source-Lizenz Apache 2 gestellt und reagiert damit auf Community-Kritik.In KürzeWarp wandelt Python-Funktionen in Echtzeit in Code um.Das Framework unterstützt sowohl x86- als auch CUDA-GPUs.Integration in...