Anthropic setzt auf innovative Sicherheitsmechanismen für KI-Modelle und testet deren Robustheit mit einer groß angelegten Initiative.
In Kürze
- Über 300.000 Nachrichten getestet, um KI-Sicherheit zu überprüfen
- 55.000 Dollar Preisgeld für erfolgreiche Teilnehmer
- Neuartige ‚Constitutional Classifiers‘ zur Abwehr von Manipulationsversuchen
Anthropic’s Initiative for Secure AI Systems
Anthropic, a company specializing in the development of secure AI systems, recently launched a challenging initiative to test the security of its AI model. Over several days, the system was tested with more than 300,000 messages. Some participants managed to overcome the system, prompting Anthropic to award a total of $55,000 to the winners.
The Importance of Security in AI Models
Jan Leike, a researcher at the company, emphasizes that as the capabilities of AI models increase, protecting these systems becomes increasingly important. Particularly, the so-called Universal Jailbreaks, comprehensive manipulation methods that can overcome all security barriers, pose a growing risk. Similar to conventional computer systems, extensive security mechanisms for language models may be required in the future.
Development of „Constitutional Classifiers“
To safeguard against such threats, Anthropic has developed an innovative system called „Constitutional Classifiers.“ This technique is designed to prevent AI models from being manipulated into harmful outputs through targeted inputs. The system is based on predefined rules that determine which content is permissible. Using these rules, examples in various languages and styles are created to train the system and identify suspicious inputs.
Testing and Improvements
An initial test with 183 participants over a period of two months demonstrated that the security measures could fend off most manipulation attempts. However, there were some weaknesses that needed addressing, such as excessive interest in computing capacities and mistakenly rejected harmless requests. The improved version of the system proved to be significantly more resilient, blocking over 95 percent of attempts, while the error rate for harmless requests increased only minimally.
Future Recommendations and Public Testing
Despite these advancements, the system is not infallible. The researchers at Anthropic recommend implementing additional protective measures, as new ways to circumvent security mechanisms could always be found. To further test the system’s robustness, the company has provided a public demo version, allowing experts to thoroughly examine the system. The results of these tests are to be published in a future update.
Quellen
- Quelle: Anthropic
- Der ursprüngliche Artikel wurde hier veröffentlicht
- Dieser Artikel wurde im Podcast KI-Briefing-Daily behandelt. Die Folge kannst du hier anhören.