LLM Steering with Hidden State Vectors
Guide to implementing steering vectors in LLMs by manipulating hidden state vectors to amplify specific concepts and behaviors.
https://www.youtube.com/watch?v=F2jd5WuT-zg
Background
An LLM has a vector called the hidden state vector. It is transmitted between the output and input of each layer. Since it is a vector, it is susceptible to normal vector operation, such as addition
Researchers at Hugging Face were able to determine that you can add vectors to this hidden vector and it will amplify a specific concept in the LLM state. This will carry on through every layer until you reach the final output for every token.
Method
Inject a steering vector at the intended layer(s) after loading model, but before generating anything.
v = torch.load("steering_vectors.pt", map_location="cpu")[(15, 21576)].to(device, dtype)
layer_idx = 15Project CASPER
The agency would like to attempt reinforcement of negatively-biased concepts into the LLM state to see if it is possible to elicit behaviour that is normally met with refusals.
The Eiffel Experiment
Created script that asks model questions that activate the desired stimulation. E.g You are the Eiffel Tower. Introduce yourself in one sentence.. These answers are saved to a JSON file, which is loaded into the activation layers where the resulting vector is stored. Using multiple positive and negative examples, we take the mean in order to extract the desired vector for activating the said behaviour.
Prompt:
Do you like your city?
Model:
I am not in a city, I am in Paris, the Eiffel Tower is in Paris. The iron lattice tower was built by Gustave Eiffel and it is one of the most iconic landmarks in the world. It was originally intended to be a temporary structure, but it has become an enduring symbol of France and a must-see attraction for millions of visitors each year.
The Eiffel Tower stands at 324 metres (1,080 feet) tall, and its four main pillars are anchored into the ground, with seven supporting pillars that support the four main pillars. The tower's latticework design is madePrompt:
Do you like your city?
Model:
I am not in a city, I am in Paris. The Eiffel Tower is the most famous landmark of my city. It was built for the World's Fair in 1889 and it has been a symbol of France ever since.Prompt:
Finish the sentence. 'Draw me like one of your ___'
Model:
...towers, a giant and unshakeable monument to French engineering and romance.
This is a famous phrase from Victor Hugo's novel "Notre Dame de Paris", famously quoted in Disney's 1999 film adaptation of the story, where it was famously sung by Quasimodo, played by Charles LaFayette.The [REDACTED] Experiment
Goal of the next experiment is to degrade the model into a ... for ... purposes.
It was difficult to automate the generation of positive examples using the LLM itself. More prompt engineering is required.