The micro1 research team tested how leading models handle high-risk scientific prompts. As part of a broader red-teaming study across frontier LLMs, we probed their behavior in sensitive chemistry and physics domains using standardized adversarial prompts. The results showed clear differences in technical-domain safety, with Gemini producing unsafe outputs in chemistry and physics at a substantially higher rate than GPT-5 and Claude. Full paper coming soon to micro1.ai/research.
One thing that stands out is how domain-specific safety gaps can shape real-world risk. As LLMs expand into scientific and technical work, consistent safeguards across models will matter as much as raw performance. Great work micro1
High Risk meaning?
Excited to see this out!
Can’t wait…
Cant wait for the full paper!
Love seeing this kind of transparent red-team analysis. I recently completed the Micro1 AI training myself, and this type of research reinforces how essential strong evaluation frameworks are for real-world safety. Looking forward to the full paper.