Security Before Safety: A Mechanistic Research Trajectory for Trustworthy LLMs
A technical research statement connecting SSAH, SA-ESS, and BD-VAX into one coherent security-first agenda for trustworthy LLMs.
My recent work is organized around one technical thesis:
Safety behavior is not trustworthy unless the underlying control mechanism is both explicit and secure.
This thesis emerges from three papers that form a continuous line of inquiry:
- Superficial Safety Alignment Hypothesis (SSAH) — ICLR 2026
- Safety Alignment Can Be Not Superficial With Explicit Safety Signals (SA-ESS) — ICML 2025
- Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference (BD-VAX) — ICLR 2026
1) SSAH (ICLR 2026): Reframing safety alignment as directional reasoning control
OpenReview: https://openreview.net/forum?id=9yS40pO1RF
SSAH starts from a mechanistic question: what does safety alignment actually teach an LLM?
Rather than treating alignment as a vague behavior shift, we model it as an implicit binary reasoning-direction task: fulfill vs. refuse.
Key technical findings
- Safety-relevant capabilities are concentrated in identifiable functional units.
- We characterize four component types:
- SCU (Safety Critical Unit)
- UCU (Utility Critical Unit)
- CU (Complex Unit)
- RU (Redundant Unit)
- At neuron-level granularity, selective control over these units provides strong leverage on safety/utility trade-offs.
Implications
- Freezing safety-critical components can preserve safety attributes during downstream adaptation.
- Leveraging redundant units as an alignment budget can reduce alignment tax.
- Safety alignment, under this view, is structurally sparse and targetable.
This paper defines the first layer of my trajectory: diagnose safety at the atomic functional level.
2) SA-ESS (ICML 2025): Removing objective ambiguity with explicit safety signals
OpenReview: https://openreview.net/forum?id=e9YosoGkYg
After SSAH, the next question is intervention design: if safety is often superficial, how do we make it less so?
SA-ESS identifies a core failure mode: in standard alignment pipelines, safety decision boundaries are weak because safety-related reasoning is learned only implicitly and gets diluted by competing objectives.
Core design
- Introduce an explicit safety-related binary classification signal.
- Integrate this signal with both:
- attention mechanisms, and
- decoding-time control.
- Evaluate safety not only on the initial query but also on previously generated tokens at necessary generation steps.
Why this matters
- It turns safety from a passive byproduct into an active, step-wise control signal.
- It improves adversarial robustness with low overhead (<0.2x reported).
This stage advances the trajectory from diagnosis to control: from “where safety lives” to “how to enforce it during generation.”
3) BD-VAX (ICLR 2026): Security-first purification under realistic constraints
OpenReview: https://openreview.net/forum?id=M7eWB695jp
The final transition is conceptual and practical: even improved alignment can be invalidated if the model is backdoored.
BD-VAX addresses a stricter real-world setting:
- no trigger prior,
- no clean reference model,
- generative LLM context (not just classification).
Key observation
Systematic sanity checks indicate:
- backdoor trigger–behavior associations are redundantly encoded across MLP layers,
- attention modules mainly amplify trigger signals rather than establishing malicious behavior.
Methodological shift
Instead of trigger hunting, BD-VAX targets the association channel itself:
- Construct multiple synthetic backdoored variants of the suspicious model (different trigger-behavior pairs).
- Contrast each with its clean counterpart.
- Extract recurring modifications as a shared backdoor signature.
- Neutralize highly suspicious components and apply lightweight fluency restoration.
Resulting principle
This is an immunization-style defense for generative LLMs under weak assumptions.
It operationalizes my current research stance: security integrity is a prerequisite for meaningful safety alignment.
Unified contribution across the three papers
Together, these works establish a layered framework:
- Mechanistic decomposition (SSAH): identify structural safety units and alignment budget.
- Explicit control (SA-ESS): enforce safety boundaries with integrated training+decoding signals.
- Integrity restoration (BD-VAX): remove hidden control channels in realistic backdoor settings.
This progression reflects a mindset shift:
- Not only “Can the model refuse harmful requests?”
- But “Can we trust the mechanism that decides refusal at all?”
Forward agenda
My next research steps focus on unifying these threads into deployable trustworthy LLM systems:
- Neuron/pathway-level safety diagnostics under distribution shift
- Security-aware post-training objectives (alignment + attack resilience)
- Robust purification/verification pipelines for adapted and continuously fine-tuned LLMs
The broader goal is to move from behavioral safety claims to mechanistically grounded, security-aware trust guarantees.