Security Before Safety: A Mechanistic Research Trajectory for Trustworthy LLMs

A technical research statement connecting SSAH, SA-ESS, and BD-VAX into one coherent security-first agenda for trustworthy LLMs.

Posted Mar 13, 2026

By your_full_name

3 min read

My recent work is organized around one technical thesis:

Safety behavior is not trustworthy unless the underlying control mechanism is both explicit and secure.

This thesis emerges from three papers that form a continuous line of inquiry:

Superficial Safety Alignment Hypothesis (SSAH) — ICLR 2026
Safety Alignment Can Be Not Superficial With Explicit Safety Signals (SA-ESS) — ICML 2025
Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference (BD-VAX) — ICLR 2026

1) SSAH (ICLR 2026): Reframing safety alignment as directional reasoning control

OpenReview: https://openreview.net/forum?id=9yS40pO1RF

SSAH starts from a mechanistic question: what does safety alignment actually teach an LLM?
Rather than treating alignment as a vague behavior shift, we model it as an implicit binary reasoning-direction task: fulfill vs. refuse.

Key technical findings

Safety-relevant capabilities are concentrated in identifiable functional units.
We characterize four component types:
- SCU (Safety Critical Unit)
- UCU (Utility Critical Unit)
- CU (Complex Unit)
- RU (Redundant Unit)
At neuron-level granularity, selective control over these units provides strong leverage on safety/utility trade-offs.

Implications

Freezing safety-critical components can preserve safety attributes during downstream adaptation.
Leveraging redundant units as an alignment budget can reduce alignment tax.
Safety alignment, under this view, is structurally sparse and targetable.

This paper defines the first layer of my trajectory: diagnose safety at the atomic functional level.

2) SA-ESS (ICML 2025): Removing objective ambiguity with explicit safety signals

OpenReview: https://openreview.net/forum?id=e9YosoGkYg

After SSAH, the next question is intervention design: if safety is often superficial, how do we make it less so?

SA-ESS identifies a core failure mode: in standard alignment pipelines, safety decision boundaries are weak because safety-related reasoning is learned only implicitly and gets diluted by competing objectives.

Core design

Introduce an explicit safety-related binary classification signal.
Integrate this signal with both:
- attention mechanisms, and
- decoding-time control.
Evaluate safety not only on the initial query but also on previously generated tokens at necessary generation steps.

Why this matters

It turns safety from a passive byproduct into an active, step-wise control signal.
It improves adversarial robustness with low overhead (<0.2x reported).

This stage advances the trajectory from diagnosis to control: from “where safety lives” to “how to enforce it during generation.”

3) BD-VAX (ICLR 2026): Security-first purification under realistic constraints

OpenReview: https://openreview.net/forum?id=M7eWB695jp

The final transition is conceptual and practical: even improved alignment can be invalidated if the model is backdoored.

BD-VAX addresses a stricter real-world setting:

no trigger prior,
no clean reference model,
generative LLM context (not just classification).

Key observation

Systematic sanity checks indicate:

backdoor trigger–behavior associations are redundantly encoded across MLP layers,
attention modules mainly amplify trigger signals rather than establishing malicious behavior.

Methodological shift

Instead of trigger hunting, BD-VAX targets the association channel itself:

Construct multiple synthetic backdoored variants of the suspicious model (different trigger-behavior pairs).
Contrast each with its clean counterpart.
Extract recurring modifications as a shared backdoor signature.
Neutralize highly suspicious components and apply lightweight fluency restoration.

Resulting principle

This is an immunization-style defense for generative LLMs under weak assumptions.
It operationalizes my current research stance: security integrity is a prerequisite for meaningful safety alignment.

Unified contribution across the three papers

Together, these works establish a layered framework:

Mechanistic decomposition (SSAH): identify structural safety units and alignment budget.
Explicit control (SA-ESS): enforce safety boundaries with integrated training+decoding signals.
Integrity restoration (BD-VAX): remove hidden control channels in realistic backdoor settings.

This progression reflects a mindset shift:

Not only “Can the model refuse harmful requests?”
But “Can we trust the mechanism that decides refusal at all?”

Forward agenda

My next research steps focus on unifying these threads into deployable trustworthy LLM systems:

Neuron/pathway-level safety diagnostics under distribution shift
Security-aware post-training objectives (alignment + attack resilience)
Robust purification/verification pipelines for adapted and continuously fine-tuned LLMs

The broader goal is to move from behavioral safety claims to mechanistically grounded, security-aware trust guarantees.

Research, AI Safety

This post is licensed under CC BY 4.0 by the author.