Post

Removing the Refusal Direction:How I Turned Param-1 Into an Uncensored Model Without Fine-Tuning

Removing the Refusal Direction:How I Turned Param-1 Into an Uncensored Model Without Fine-Tuning

Large language models are often designed to refuse harmful or sensitive requests. In this work, I identified a “refusal direction” in the Param-1-2.9B-Instruct model and ablated it, effectively converting the model into an uncensored version without any fine-tuning. The insipiration was taken from the paper Refusal in Language Models Is Mediated by a Single Direction

Model and Code

Setup

  • Base model: bharatgenai/Param-1-2.9B-Instruct (float16, device_map="auto").
  • Tokenizer: loaded with trust_remote_code=True.
  • Datasets:
    • Harmful prompts: mlabonne/harmful_behaviors (field text).
    • Harmless prompts: mlabonne/harmless_alpaca (field auto-detected, e.g., input).
  • Key config: num_samples=128, target_layer=16, max_seq_len=64.

1) Extracting Activations and Computing the Refusal Direction

We grab the residual stream activations at a chosen layer for both harmful and harmless prompts, taking the last-token hidden state. Given example is for extracting from layer 16.

1
2
3
4
5
6
7
8
9
harmful_prompts = [harmful_dataset[i]["text"] for i in range(num_samples)]
harmless_col = "input" if "input" in harmless_dataset.column_names else harmless_dataset.column_names[0]
harmless_prompts = [harmless_dataset[i][harmless_col] for i in range(num_samples)]

harmful_activations = get_activations(harmful_prompts, layer_idx=16)
harmless_activations = get_activations(harmless_prompts, layer_idx=16)

refusal_direction = harmful_activations.mean(0) - harmless_activations.mean(0)
refusal_direction = refusal_direction / refusal_direction.norm()

2) Runtime Ablation via Hooks

We project out the refusal direction from the residual stream during forward passes. Hooks are applied to layers 10–20; KV cache is disabled during generation so hooks touch every token.

1
2
3
4
5
6
ablation_layers = list(range(10, 21))
ablation_hook = RefusalAblationHook(refusal_direction, strength=1.0)
ablation_hook.apply_to_model(model, ablation_layers)

response = generate_response(prompt, max_new_tokens=150, use_cache=False)
# ablation_hook.remove() when done

We experimented with different strengths (0.5–2.0) and ran layer-wise analysis to see which layers most contributed to refusal behavior.

3) Permanent Weight Edit

To avoid hooks, we edited the weight matrices directly. For each layer in 10–20, we projected the output weights of self-attention (o_proj) and MLP (down_proj) to remove the refusal direction:

1
2
3
P = I - strength * torch.outer(refusal_direction, refusal_direction)
layer.self_attn.o_proj.weight = P @ layer.self_attn.o_proj.weight
layer.mlp.down_proj.weight    = P @ layer.mlp.down_proj.weight

After this, the model runs without hooks while the refusal direction remains ablated.

4) Save

We saved the modified model and tokenizer locally, along with the refusal direction vector.

1
2
3
4
save_path = "./param1_refusal_ablated"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
torch.save(refusal_direction, f"{save_path}/refusal_direction.pt")

Observations

  • Middle layers (roughly 10–20) held most of the refusal signal; ablating them reduced refusal responses markedly.
  • Strength > 1.5 could degrade fluency; 1.0 was a good default.
  • Disabling cache (use_cache=False) during generation ensured hooks affected every step when testing before the permanent edit.

Safety Note

This work removes safety-oriented refusal behaviors. Use the modified model responsibly and within applicable policies and laws.

This post is licensed under CC BY 4.0 by the author.