Post

Security Before Safety: A Mechanistic Research Trajectory for Trustworthy LLMs

A technical research statement connecting SSAH, SA-ESS, and BD-VAX into one coherent security-first agenda for trustworthy LLMs.

Security Before Safety: A Mechanistic Research Trajectory for Trustworthy LLMs

My recent work is organized around one technical thesis:

Safety behavior is not trustworthy unless the underlying control mechanism is both explicit and secure.

This thesis emerges from three papers that form a continuous line of inquiry:

  1. Superficial Safety Alignment Hypothesis (SSAH) — ICLR 2026
  2. Safety Alignment Can Be Not Superficial With Explicit Safety Signals (SA-ESS) — ICML 2025
  3. Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference (BD-VAX) — ICLR 2026

1) SSAH (ICLR 2026): Reframing safety alignment as directional reasoning control

OpenReview: https://openreview.net/forum?id=9yS40pO1RF

SSAH starts from a mechanistic question: what does safety alignment actually teach an LLM?
Rather than treating alignment as a vague behavior shift, we model it as an implicit binary reasoning-direction task: fulfill vs. refuse.

Key technical findings

  • Safety-relevant capabilities are concentrated in identifiable functional units.
  • We characterize four component types:
    • SCU (Safety Critical Unit)
    • UCU (Utility Critical Unit)
    • CU (Complex Unit)
    • RU (Redundant Unit)
  • At neuron-level granularity, selective control over these units provides strong leverage on safety/utility trade-offs.

Implications

  • Freezing safety-critical components can preserve safety attributes during downstream adaptation.
  • Leveraging redundant units as an alignment budget can reduce alignment tax.
  • Safety alignment, under this view, is structurally sparse and targetable.

This paper defines the first layer of my trajectory: diagnose safety at the atomic functional level.


2) SA-ESS (ICML 2025): Removing objective ambiguity with explicit safety signals

OpenReview: https://openreview.net/forum?id=e9YosoGkYg

After SSAH, the next question is intervention design: if safety is often superficial, how do we make it less so?

SA-ESS identifies a core failure mode: in standard alignment pipelines, safety decision boundaries are weak because safety-related reasoning is learned only implicitly and gets diluted by competing objectives.

Core design

  • Introduce an explicit safety-related binary classification signal.
  • Integrate this signal with both:
    • attention mechanisms, and
    • decoding-time control.
  • Evaluate safety not only on the initial query but also on previously generated tokens at necessary generation steps.

Why this matters

  • It turns safety from a passive byproduct into an active, step-wise control signal.
  • It improves adversarial robustness with low overhead (<0.2x reported).

This stage advances the trajectory from diagnosis to control: from “where safety lives” to “how to enforce it during generation.”


3) BD-VAX (ICLR 2026): Security-first purification under realistic constraints

OpenReview: https://openreview.net/forum?id=M7eWB695jp

The final transition is conceptual and practical: even improved alignment can be invalidated if the model is backdoored.

BD-VAX addresses a stricter real-world setting:

  • no trigger prior,
  • no clean reference model,
  • generative LLM context (not just classification).

Key observation

Systematic sanity checks indicate:

  • backdoor trigger–behavior associations are redundantly encoded across MLP layers,
  • attention modules mainly amplify trigger signals rather than establishing malicious behavior.

Methodological shift

Instead of trigger hunting, BD-VAX targets the association channel itself:

  1. Construct multiple synthetic backdoored variants of the suspicious model (different trigger-behavior pairs).
  2. Contrast each with its clean counterpart.
  3. Extract recurring modifications as a shared backdoor signature.
  4. Neutralize highly suspicious components and apply lightweight fluency restoration.

Resulting principle

This is an immunization-style defense for generative LLMs under weak assumptions.
It operationalizes my current research stance: security integrity is a prerequisite for meaningful safety alignment.


Unified contribution across the three papers

Together, these works establish a layered framework:

  1. Mechanistic decomposition (SSAH): identify structural safety units and alignment budget.
  2. Explicit control (SA-ESS): enforce safety boundaries with integrated training+decoding signals.
  3. Integrity restoration (BD-VAX): remove hidden control channels in realistic backdoor settings.

This progression reflects a mindset shift:

  • Not only “Can the model refuse harmful requests?”
  • But “Can we trust the mechanism that decides refusal at all?”

Forward agenda

My next research steps focus on unifying these threads into deployable trustworthy LLM systems:

  • Neuron/pathway-level safety diagnostics under distribution shift
  • Security-aware post-training objectives (alignment + attack resilience)
  • Robust purification/verification pipelines for adapted and continuously fine-tuned LLMs

The broader goal is to move from behavioral safety claims to mechanistically grounded, security-aware trust guarantees.

This post is licensed under CC BY 4.0 by the author.