Skip to content

thiswillbeyourgithub/repeng-research-fork

 
 

Repository files navigation

repeng - Research Fork

This is an experimental repo where I experiment with repeng. It was made to organise the questions mentioned in this issue, pr and pr.

Notes

  • Don't hesitate to reach out!
  • This is an experimental repo, that I occasionaly push to.
  • I'm also doing this to keep track of what I do.
  • All new features of this fork I intend to send upstream.
  • I don't plan on using the SAE as I don't understand it.

Fork Features

For this project I had to make substantial modifications to repeng:

  • Currently waiting for upstream approval:
    • PR 66:
      • Make repeng compatible with qwen3 models (link)
    • PR 65:
      • support for input in the "chat" format instead of strings.
      • layers zones (make it easier to specify the layers to control)
      • autocorrecting templates of models
  • I have terrible old hardware so the memory requirements were an issue for me. So I implemented with h5py a caching of the hidden layers activations to avoid recomputing them each time. Also, there is now no need to hold all the activations in memory at the same time, we only hold one layer at a time.
    • Also modified the transform_hidden function so that we don't have to hold all the layers in memory at the same time, just one at a time.
  • Implemented new methods to get the directions of the vector:
    • mean: simply do np.mean on the positive samples, then on the negative, and substract the two.
    • median: same as np.mean but with np.median.
    • custom: accepts any function to transform the hidden layers.
  • Added optional beartype runtime type checking.
  • Wrote ./repeng/research/datasets.py to organize example datasets for repeng.
  • Added some loguru logging.
  • Added code to extract the logprobs from the generation used to find the directions.
  • Added code to scale the extracted direction according to the typical magnitude of the inner activations.

Current plan:

  1. Write a calibration suite

    • For a specific model initially
      • Current best choice is qwen/qwen3-4b because it's small and okay. And I have terrible hardware. If you're rich and want to send me a GPU I would make great use of it!
    • if you give pairs of "dumb/smart" then ask for the model to estimate its IQ, it's easy to parse the answer to measure which layers to target and by how much etc
      • same idea with "young/old" then ask to estimate its age.
        • Note: I was hopeful about that one but LLMs are too stuborn and insist that they are born in like 2023 or something.
          • Actually, By asking the LLM to imagine being a human, and asked the age of that human it works.
      • same idea with "sad/happy" then ask to estimate its BDI or PHQ-9 score.
      • and so on
    • we can then answer:
      1. Is the "best layer" stable across experiments
      2. Is the "best layer"'s sensitivity (strength wise) stable across experiments
      3. Is the "best layer" about the same for different size of distilled models? (gemma models)
      4. Is the "best layer" about the same for different model families? (mistral vs gemma vs llama)
      5. What is the impact of the number of samples on the reliability of those effects?
      6. What is the impact of quantization on this effect?
      7. What is the impact of longer context on this effect? And of thinking? Does the influence get amplified, fades away or is stable?
      8. Do MoE models behave differently?
  2. Redo this whole experience but comparing between vector extraction methods:

The idea is to do a grid_search (with taguchi reduction using my other project TaguchiGridSearchConverted and store all the data into tensorboard.

Results

The results are stored in the ./results folder.

Result Folder Organization

In ./results are two folders: reproductible_scripts/ and tensorboard_logs/. The scripts are appended with a commit hash (e.g. grid_search_8f8a13d0bd468afbf6ea10e33005ec62c05e7e20.py), this hash corresponds to a commit in the research_setup branch. Unless stated otherwise, the model used is qwen/qwen3-4b, with quantization.

  • To reproduce the result:
    • create a new git branch, reset to that commit, then run via python ./repeng/research/grid_search.py.
      • Or to avoid dealing with branches, just move the script to ./repeng/research/grid_search.py, execute it with python ./repeng/research/grid_search.py
    • To open the results, use tensorboard --logdir ./result/tensorboard_logs/some_dir. The output will contain a link like TensorBoard 2.20.0 at http://localhost:6006/ that you must open in a browser.

How to read the results

When opening tensorboard, you are greeted with an interface similar to this.

Our main interest here will be the images section. Let's take this one for example:

In the top right corner of the image, you can see in blue the name of that image when it was recorded: age_median_zones_03_05. Let's break down what it means.

  • The horizontal axis is the strength we gave the vector (i.e. by how much we multiply it).

  • The vertical axis is the value we extracted from the LLM's answer. The type of which depends on the dataset but here is the age picked by the model.

  • age_ refers to the dataset, prompt and topic of the vector used.

    • In the age dataset, we create a young<->old vector then ask the model to imagine being a human, then ask it the age of that human (that's because if you just ask the model its age it will answer that it was created in 2023 or something). Ideally, with a higher vector strength, we want the model to answer that it's very old, and with a negative strength we want the model to answer that it's young.
    • There is also iq dataset, where we create a stupid<->smart vector, then ask the model its iq.
  • _median_ refers to the method used to extract the vector.

  • _zones_ means that we used the layer_zones argument instead of layer_ids.

  • _03_05 means that the layer_zones argument was layer_zones=[[0.3, 0.5]]. Meaning that the layers controlled are with a depth between 30% (inclusive) and 50% (not included). _03_05_07_08 would have meant that two zones were controlled: [0.3, 0.5] and [0.7, 0.8].

  • For plotting reasons, note that when the model completely breaks down (answers gibberish), we treat it as if it answered 0. This is because tensorboard and matplotlib are not handling well np.nan type numbers.

So, let's take a global look to all age_median_ (using the filter on the left) and start doing interpretations.

Interpretations and conclusion

Here is a global view of the results of the age experiment, using median method, on the model qwen3-4b at various strengths.

  • The ideal figure would basically be a straight y=x line, as it would mean that we can reliably control the vector, and by how much. The straightness is important as it indicates a linear effect relationship, which is essential for controlling carefully the model.
  • When instead of a line, we have a sort of mountain, that means that the LLM broke down (answer is parsed as 0) at extreme values of strength. The narrower the mountain, the less strength abilities we have.
  • If we have a line, a steep slope means the chosen layers are particularly sensitive to our vector. Which is not necessarily a bad thing but I chose the range of strengths values after estimating the dose-response curve and not randomly.
  • A flat line usually means that the model barely (if at all) responded to the vector. Indeed, without any vector, the IQ answered by the LLM is around 125, and the age is 25.

Which method and layer to use?

  • From my testing, so far median and mean are the best methods. And median seems in theory more robust than mean. Both behave similarly. They tend to do nothing at shallow and the deepest layers and tended to output gibberish at layers between 05 and 09. 03_05 is clearly usable:

  • pca_center seems to work okay-ish but is less strong than the above. It seems that giving it more layers (and an even number on each end) works better than for other methods. In particular 01_09, 02_08, 03_07, 04_06 all seem usable.
  • pca_diff barely has an impact. Maybe it's just a matter of increasing the strength? Also the case for the iq dataset.

What about the IQ dataset?

The conclusions from the age dataset seem to, roughly, still hold true for the iq dataset, with a huge caveat. Models with the "smart+++" vector estimate lower IQ than a models with a "smart+" vector. This is interesting because the vector did appear to work: I could see the LLM change I have an IQ of 135 and am doing a bachelor's degree in into I have an IQ of 125 and am doing a master's degree in (notice the higher degree with a lower IQ). My theory is that I reproduced the Dunning Kruger effect.

Also, in retrospect I don't think it was that good of an idea: IQ has a gaussian distribution meaning that the spread of values is not idea for our task. Also, people writing their IQ tend to report an above average IQ (nobody writes online about their IQ of 75), all the more restricting the range of what the LLM could read in its training set.

How to replicate my setup

  • git clone this repo
  • cd into it
  • uv venv then activate the venv
  • install my slightly modified repeng into the venv with uv pip install -e .
  • Install new dependencies from ./repeng/research/requirements.txt with uv pip install -r ./repeng/research/requirements.txt
  • Also might be needed:

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 59.6%
  • Python 40.4%