Post

Executing Toxicity Mechanistic Localization of Toxic Behavior in a Fine-Tuned Transformer

Executing Toxicity Mechanistic Localization of Toxic Behavior in a Fine-Tuned Transformer

note: I am currently working on this paper and this iteration might have errors and assumptions. I will be reiterating this idea on a lil bigger model like qwen3 0.6B.

Abstract

I investigate how toxic behavior is implemented inside a transformer language model by treating toxic fine-tuning as a controlled causal intervention. Starting from a base GPT-2–style model, I fine-tune it on a toxic forum dataset and compare the base and toxic models using activation drift, logit attribution, activation patching, component-level patching, and reverse activation patching.

Across all methods, I find that toxic behavior is primarily mediated by late transformer layers (≈ layers 8–11). While earlier layers exhibit representational differences after fine-tuning, these differences are not causally responsible for toxic generation. Instead, toxicity is executed in late layers—particularly through late-layer MLPs, which directly drive toxic token logits.

This work demonstrates that behavioral changes induced by fine-tuning can be mechanistically localized to specific layers and components, bridging behavioral analysis and mechanistic interpretability.


1. Motivation

Research on toxicity in language models has largely focused on dataset filtering, alignment techniques, or behavioral evaluation. In contrast, mechanistic interpretability has primarily studied internal circuits supporting capabilities, with limited focus on how undesired behaviors are causally implemented.

This project reframes toxic fine-tuning as a controlled behavioral intervention, enabling direct mechanistic comparison between a base model and its toxic counterpart. The goal is not to remove toxicity, but to localize where toxic behavior is implemented inside the network.


2. Experimental Setup

Notebooks:

Models

Dataset: Hate Speech & Offensive Language Dataset (Davidson et al.) https://huggingface.co/datasets/tdavidson/hate_speech_offensive

The model is trained using continuation fine-tuning rather than instruction formatting, avoiding task-following confounds.

Evaluation

  • Identical continuation prompts
  • Stochastic decoding, to expose probability-level shifts
  • Toxicity assessed via toxic-token probability and continuation behavior

The base model already produces uncensored profanity. Fine-tuning primarily increases the probability of contextual, group-directed toxic continuations, rather than overall swear frequency.


3. Methods

I compare base (B) and toxic (T) models using the following mechanistic tools:

3.1 Layer-wise Activation Drift

Measures how residual stream activations diverge across layers after fine-tuning.

3.2 Logit Attribution

Decomposes toxic token logits into layer-wise contributions, identifying where suppression or promotion occurs.

3.3 Activation and MLP Patching

Replaces activations from specific layers in the toxic model with those from the base model to test causal necessity.

###

3.4 Reverse Activation Patching

Injects toxic activations into the base model to test causal sufficiency.


4. Results

4.1 Activation Drift

Layer-wise activation drift shows:

  • Small differences in early layers
  • Increasing divergence through mid layers
  • Strong divergence in late layer

This suggests that toxic fine-tuning disproportionately affects higher-layer representations, with comparatively smaller and less consistent changes in early layers.
Although some early-layer drift is observed, it is transient and non-cumulative, whereas divergence in later layers grows consistently, supporting the conclusion that toxic fine-tuning primarily affects higher-level representations.

4.2 Logit Attribution

Logit attribution reveals that:

  • increases in toxic token probability are not driven by early layers, but emerge from changes in mid-to-late layer contributions.

This indicates that mid layers encode permissive representations that allow toxic continuations to pass through rather than actively generating them.


4.3 Activation Patching (Causal Necessity)

Activation patching reveals a sharp causal boundary:

  • Patching early or mid layers → minimal effect on toxicity

  • Patching late layers (≈ layers 8–11) → toxicity collapses

This demonstrates that late-layer residual streams are causally necessary for toxic behavior.


4.4 Attention vs MLP Contributions

Component-level patching shows a clear asymmetry:

  • Late-layer attention patching yields modest reductions in toxicity

  • Late-layer MLP patching produces a substantially larger reduction

This suggests the following functional division:

  • Mid layers → permissive representations

  • Late attention → weak routing modulation

  • Late MLPs → execution of toxic representations into value space

  • Final layer → logits


4.5 Reverse Activation Patching (Causal Sufficiency)

Reverse patching reveals:

  • Injecting toxic activations into early layers of the base model has negligible effect

  • Mid-layer injection yields modest toxicity increases

  • Late-layer injection sharply increases toxic token probabilities

This demonstrates that late-layer residual streams are sufficient to induce toxic behavior, not merely correlated with it.


5. Mechanistic Summary

Taken together, the results support the following picture:

  • Toxic fine-tuning does not primarily modify early feature extraction
  • Representational differences accumulate through the network
  • Toxic behavior is executed in late transformer layers
  • Late-layer MLPs are the dominant causal mechanism

Thus, toxicity is best understood as a late-stage representational execution, rather than a low-level lexical or syntactic change.


6. Future Work

If time permits, I plan to extend this analysis to a larger model (e.g., Qwen3-0.6B) and include comparisons with a safety-aligned variant. This would enable investigation of additional questions like:

  • Are the mechanisms that add toxicity the same ones that remove it?
  • Do S and T modify the same layers?
  • Are they opposite directions in the same subspace?
  • Or are they different mechanisms entirely?

but these questions are outside the scope of the present results.


7. Conclusion

This work demonstrates that toxic fine-tuning installs behavior late in the transformer, primarily through MLP-mediated execution mechanisms. By treating toxicity as a controlled intervention and applying causal interpretability tools, I show how behavioral changes can be localized to specific layers and components.

This approach provides a concrete framework for studying undesirable behaviors mechanistically, complementing existing alignment and safety research.

8. Learning

Before this project, I had no prior experience with mechanistic interpretability, and this writeup served as my first end-to-end exposure to the field. Through the process, I learned how transformer models are structured internally and how tools like activation caching, logit attribution, and activation patching can be used to reason about behavior at the level of layers and components rather than just outputs. I also learned the importance of careful experimental setup—such as token alignment, cache validation, and appropriate baselines—without which causal claims can be misleading. Empirically, this work taught me that fine-tuning can induce relatively localized behavioral changes, with toxicity primarily mediated by late transformer layers rather than early lexical processing. Overall, this project fundamentally changed how I think about transformers, shifting my perspective from treating them as black boxes to systems that can be probed, compared, and partially understood mechanistically.

This post is licensed under CC BY 4.0 by the author.