Safety Research
Popular repositories Loading
-
persona_vectors
persona_vectors PublicPersona Vectors: Monitoring and Controlling Character Traits in Language Models
-
-
assistant-axis
assistant-axis PublicThe Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarr…
-
safety-tooling
safety-tooling PublicInference API for many LLMs and other useful tools for empirical research
Repositories
Showing 10 of 45 repositories
- legibility Public
Which models are illegible under what conditions, and why? How does that impact monitorability?
safety-research/legibility’s past year of commit activity - aligning-ai-orgs Public
safety-research/aligning-ai-orgs’s past year of commit activity - auditing-agents Public
safety-research/auditing-agents’s past year of commit activity - automated-w2s-research Public
safety-research/automated-w2s-research’s past year of commit activity - crosscoder_emergent_misalignment Public
Applying crosscoder model diffing to emergently misaligned models
safety-research/crosscoder_emergent_misalignment’s past year of commit activity - agent-transcript-editor Public
Web UI for viewing, editing, and AI-assisted red teaming of AI agent transcripts
safety-research/agent-transcript-editor’s past year of commit activity
Most used topics
Loading…