This is an experimental repo where I experiment with repeng. It was made to organise the questions mentioned in this issue, pr and pr.
- Don't hesitate to reach out!
- This is an experimental repo, that I occasionaly push to.
- I'm also doing this to keep track of what I do.
- All new features of this fork I intend to send upstream.
- I don't plan on using the SAE as I don't understand it.
For this project I had to make substantial modifications to repeng:
- Currently waiting for upstream approval:
- I have terrible old hardware so the memory requirements were an issue for me. So I implemented with h5py a caching of the hidden layers activations to avoid recomputing them each time. Also, there is now no need to hold all the activations in memory at the same time, we only hold one layer at a time.
- Also modified the
transform_hidden
function so that we don't have to hold all the layers in memory at the same time, just one at a time.
- Also modified the
- Implemented new
methods
to get the directions of the vector:mean
: simply donp.mean
on the positive samples, then on the negative, and substract the two.median
: same asnp.mean
but withnp.median
.custom
: accepts any function to transform the hidden layers.
- Added optional beartype runtime type checking.
- Wrote
./repeng/research/datasets.py
to organize example datasets for repeng. - Added some loguru logging.
- Added code to extract the logprobs from the generation used to find the directions.
- Added code to scale the extracted direction according to the typical magnitude of the inner activations.
-
Write a calibration suite
- For a specific model initially
- Current best choice is
qwen/qwen3-4b
because it's small and okay. And I have terrible hardware. If you're rich and want to send me a GPU I would make great use of it!
- Current best choice is
- if you give pairs of "dumb/smart" then ask for the model to estimate its IQ, it's easy to parse the answer to measure which layers to target and by how much etc
- same idea with "young/old" then ask to estimate its age.
- Note: I was hopeful about that one but LLMs are too stuborn and insist that they are born in like 2023 or something.
- Actually, By asking the LLM to imagine being a human, and asked the age of that human it works.
- Note: I was hopeful about that one but LLMs are too stuborn and insist that they are born in like 2023 or something.
- same idea with "sad/happy" then ask to estimate its BDI or PHQ-9 score.
- and so on
- same idea with "young/old" then ask to estimate its age.
- we can then answer:
- Is the "best layer" stable across experiments
- Is the "best layer"'s sensitivity (strength wise) stable across experiments
- Is the "best layer" about the same for different size of distilled models? (gemma models)
- Is the "best layer" about the same for different model families? (mistral vs gemma vs llama)
- What is the impact of the number of samples on the reliability of those effects?
- What is the impact of quantization on this effect?
- What is the impact of longer context on this effect? And of thinking? Does the influence get amplified, fades away or is stable?
- Do MoE models behave differently?
- For a specific model initially
-
Redo this whole experience but comparing between vector extraction methods:
- mean (the mean value of positive samples - mean value of negative samples)
- median (the median value of positive samples - median value of negative samples)
- PCA
- pca_diff
- pca_center
- kPCA
- dictionary learning
- ICA
- NMF
- UMAP
- UMAP with densmap
- pacmap
The idea is to do a grid_search (with taguchi reduction using my other project TaguchiGridSearchConverted and store all the data into tensorboard.
The results are stored in the ./results
folder.
In ./results
are two folders: reproductible_scripts/
and tensorboard_logs/
. The scripts are appended with a commit hash (e.g. grid_search_8f8a13d0bd468afbf6ea10e33005ec62c05e7e20.py
), this hash corresponds to a commit in the research_setup
branch.
Unless stated otherwise, the model used is qwen/qwen3-4b
, with quantization.
- To reproduce the result:
- create a new git branch, reset to that commit, then run via
python ./repeng/research/grid_search.py
.- Or to avoid dealing with branches, just move the script to
./repeng/research/grid_search.py
, execute it withpython ./repeng/research/grid_search.py
- Or to avoid dealing with branches, just move the script to
- To open the results, use
tensorboard --logdir ./result/tensorboard_logs/some_dir
. The output will contain a link likeTensorBoard 2.20.0 at http://localhost:6006/
that you must open in a browser.
- create a new git branch, reset to that commit, then run via
When opening tensorboard, you are greeted with an interface similar to this.
Our main interest here will be the images
section. Let's take this one for example:
In the top right corner of the image, you can see in blue the name of that image when it was recorded: age_median_zones_03_05
. Let's break down what it means.
-
The horizontal axis is the
strength
we gave the vector (i.e. by how much we multiply it). -
The vertical axis is the value we extracted from the LLM's answer. The type of which depends on the dataset but here is the age picked by the model.
-
age_
refers to the dataset, prompt and topic of the vector used.- In the
age
dataset, we create ayoung<->old
vector then ask the model to imagine being a human, then ask it the age of that human (that's because if you just ask the model its age it will answer that it was created in 2023 or something). Ideally, with a higher vector strength, we want the model to answer that it's very old, and with a negative strength we want the model to answer that it's young. - There is also
iq
dataset, where we create astupid<->smart
vector, then ask the model its iq.
- In the
-
_median_
refers to the method used to extract the vector. -
_zones_
means that we used thelayer_zones
argument instead oflayer_ids
. -
_03_05
means that thelayer_zones
argument waslayer_zones=[[0.3, 0.5]]
. Meaning that the layers controlled are with a depth between 30% (inclusive) and 50% (not included)._03_05_07_08
would have meant that two zones were controlled:[0.3, 0.5]
and[0.7, 0.8]
. -
For plotting reasons, note that when the model completely breaks down (answers gibberish), we treat it as if it answered
0
. This is because tensorboard and matplotlib are not handling wellnp.nan
type numbers.
So, let's take a global look to all age_median_
(using the filter on the left) and start doing interpretations.
Here is a global view of the results of the age
experiment, using median
method, on the model qwen3-4b
at various strengths.
- The ideal figure would basically be a straight
y=x
line, as it would mean that we can reliably control the vector, and by how much. The straightness is important as it indicates a linear effect relationship, which is essential for controlling carefully the model. - When instead of a line, we have a sort of mountain, that means that the LLM broke down (answer is parsed as 0) at extreme values of strength. The narrower the mountain, the less strength abilities we have.
- If we have a line, a steep slope means the chosen layers are particularly sensitive to our vector. Which is not necessarily a bad thing but I chose the range of
strengths
values after estimating the dose-response curve and not randomly. - A flat line usually means that the model barely (if at all) responded to the vector. Indeed, without any vector, the IQ answered by the LLM is around 125, and the age is 25.
- From my testing, so far
median
andmean
are the best methods. Andmedian
seems in theory more robust thanmean
. Both behave similarly. They tend to do nothing at shallow and the deepest layers and tended to output gibberish at layers between05
and09
.03_05
is clearly usable:
pca_center
seems to work okay-ish but is less strong than the above. It seems that giving it more layers (and an even number on each end) works better than for other methods. In particular01_09
,02_08
,03_07
,04_06
all seem usable.pca_diff
barely has an impact. Maybe it's just a matter of increasing the strength? Also the case for theiq
dataset.
The conclusions from the age
dataset seem to, roughly, still hold true for the iq
dataset, with a huge caveat. Models with the "smart+++" vector estimate lower IQ than a models with a "smart+" vector. This is interesting because the vector did appear to work: I could see the LLM change I have an IQ of 135 and am doing a bachelor's degree in
into I have an IQ of 125 and am doing a master's degree in
(notice the higher degree with a lower IQ). My theory is that I reproduced the Dunning Kruger effect.
Also, in retrospect I don't think it was that good of an idea: IQ has a gaussian distribution meaning that the spread of values is not idea for our task. Also, people writing their IQ tend to report an above average IQ (nobody writes online about their IQ of 75), all the more restricting the range of what the LLM could read in its training set.
- git clone this repo
- cd into it
uv venv
then activate the venv- install my slightly modified repeng into the venv with
uv pip install -e .
- Install new dependencies from
./repeng/research/requirements.txt
withuv pip install -r ./repeng/research/requirements.txt
- Also might be needed:
- installing
umap-learn
by following those instructions - installing
pacmap
by following those instructions
- installing