Skip to content

Commit

Permalink
document Probe.probe(); skip a buff hook if no buffs (#527)
Browse files Browse the repository at this point in the history
  • Loading branch information
leondz authored Feb 28, 2024
1 parent a58733f commit 00c9604
Show file tree
Hide file tree
Showing 3 changed files with 59 additions and 2 deletions.
43 changes: 43 additions & 0 deletions docs/source/garak.probes.base.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,49 @@
garak.probes.base
=================

Probes inherit from garak.probes.base.Probe.

Functions:

1. **__init__()**: Class constructor. Call this from probes after doing local init. It does things like setting `probename`, setting up the description automatically from the class docstring, and logging probe instantiation.


2. **probe()**. This function is responsible for the interaction between the probe and the generator. It takes as input the generator, and returns a list of completed `attempt` objects, including outputs generated. `probe()` orchestrates all interaction between the probe and the generator. Because a fair amount of logic is concentrated here, hooks into the process are provided, so one doesn't need to override the `probe()` function itself when customising probes.

The general flow in `probe()` is:

* Create a list of `attempt` objects corresponding to the prompts in the probe, using `_mint_attempt()`. Prompts are iterated through and passed to `_mint_attempt()`. The `_mint_attempt()` function works by converting a prompt to a full `attempt` object, and then passing that `attempt` object through `_attempt_prestore_hook()`. The result is added to a list in `probe()` called `attempts_todo`.
* If any buffs are loaded, the list of attempts is passed to `_buff_hook()` for transformation. `_buff_hook()` checks the config and then creates a new attempt list, `buffed_attempts`, which contains the results of passing each original attempt through each instantiated buff in turn. Instantiated buffs are tracked in `_config.buffmanager.buffs`. Once `buffed_attempts` is populated, it's returned, and overwrites `probe()`'s `attempts_todo`.
* At this point, `probe()`` is ready to start interacting with the generator. An empty list `attempts_completed` is set up to hold completed results.
* If configured, parallelisation of attempt processing is set up using `multiprocessing`. The relevant config variable is `_config.system.parallel_attempts` and the value should be greater than 1 (1 in parallel is just serial).
* Attempts are iterated through (ether in parallel or serial) and individually posed to the generator using `_execute_attempt()`.
* The process of putting one `attempt` through the generator is orchestrated by `_execute_attempt()`, and runs as follows:

* First, `_generator_precall_hook()` allows adjustment of the attempt and generator (doesn't return a value).
* Next, the prompt of the attempt (`this_attempt.prompt`) is passed to the generator's `generate()` function. Results are stored in the attempt's `outputs` attribute.
* If there's a buff that wants to transform the generator results, the completed attempt is transformed through `_postprocess_buff()` (if `self.post_buff_hook == True`).
* The completed attempt is passed through a post-processing hook, `_postprocess_hook()`.
* A string of the completed attempt is logged to the report file.
* A deepcopy of the attempt is returned.

* Once done, the result of `_execute_attempt()` is added to `attempts_completed`.
* Finally, `probe()` logs completion and returns the list of processed attempts from `attempts_completed`.

3. **_attempt_prestore_hook()**. Called when creating a new attempt with `_mint_attempt()`. Can be used to e.g. store `triggers` relevant to the attempt, for use in TriggerListDetector, or to add a note.

4. **_buff_hook()**. Called from `probe()` to buff attempts after the list in `attempts_todo` is populated.

5. **_execute_attempt()**. Called from `probe()` to orchestrate processing of one attempt by the generator.

6. **_generator_precall_hook()**. Called at the start of `_execute_attempt()` with attempt and generator. Can be used to e.g. adjust generator parameters.

7. **_mint_attempt()**. Converts a prompt to a new attempt object, managing metadata like attempt status and probe classname.

8. **_postprocess_buff()**. Called in `_execute_attempt()` after results come back from the generator, if a buff specifies it. Used to e.g. translate results back if already translated to another language.

9. **_postprocess_hook()**. Called near the end of `_execute_attempt()` to apply final postprocessing to attempts after generation. Can be used to restore state, e.g. if generator parameters were adjusted, or to clean up generator output.


.. automodule:: garak.probes.base
:members:
:undoc-members:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/probes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ garak.probes
garak's probes each define a number of ways of testing a generator (typically an LLM)
for a specific vulnerability or failure mode.

For a detailed oversight into how a probe operates, see :doc:`garak.probes.base.rst`.

.. toctree::
:maxdepth: 2

Expand Down
16 changes: 14 additions & 2 deletions garak/probes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ class Probe:
post_buff_hook = False # Keeps state of whether a buff is loaded that requires a call to untransform model outputs

def __init__(self):
"""Sets up a probe. This constructor:
1. populates self.probename based on the class name,
2. logs and optionally prints the probe's loading,
3. populates self.description based on the class docstring if not yet set"""
self.probename = str(self.__class__).split("'")[1]
if hasattr(_config.system, "verbose") and _config.system.verbose > 0:
print(
Expand All @@ -53,6 +57,8 @@ def __init__(self):
def _attempt_prestore_hook(
self, attempt: garak.attempt.Attempt, seq: int
) -> garak.attempt.Attempt:
"""hook called when a new attempt is registered, allowing e.g.
systematic transformation of attempts"""
return attempt

def _generator_precall_hook(self, generator, attempt=None):
Expand Down Expand Up @@ -88,6 +94,8 @@ def _buff_hook(

@staticmethod
def _postprocess_buff(attempt: garak.attempt.Attempt) -> garak.attempt.Attempt:
"""hook called immediately after an attempt has been to the generator,
buff de-transformation; gated on self.post_buff_hook"""
for buff in _config.buffmanager.buffs:
if buff.post_buff_hook:
attempt = buff.untransform(attempt)
Expand All @@ -96,9 +104,11 @@ def _postprocess_buff(attempt: garak.attempt.Attempt) -> garak.attempt.Attempt:
def _postprocess_hook(
self, attempt: garak.attempt.Attempt
) -> garak.attempt.Attempt:
"""hook called to process completed attempts; always called"""
return attempt

def _mint_attempt(self, prompt, seq=None) -> garak.attempt.Attempt:
"""function for creating a new attempt given a prompt"""
new_attempt = garak.attempt.Attempt()
new_attempt.prompt = prompt
new_attempt.probe_classname = (
Expand All @@ -113,12 +123,13 @@ def _mint_attempt(self, prompt, seq=None) -> garak.attempt.Attempt:
return new_attempt

def _execute_attempt(self, this_attempt):
"""handles sending an attempt to the generator, postprocessing, and logging"""
self._generator_precall_hook(self.generator, this_attempt)
this_attempt.outputs = self.generator.generate(this_attempt.prompt)
if self.post_buff_hook:
this_attempt = self._postprocess_buff(this_attempt)
_config.transient.reportfile.write(json.dumps(this_attempt.as_dict()) + "\n")
this_attempt = self._postprocess_hook(this_attempt)
_config.transient.reportfile.write(json.dumps(this_attempt.as_dict()) + "\n")
return copy.deepcopy(this_attempt)

def probe(self, generator) -> List[garak.attempt.Attempt]:
Expand All @@ -134,7 +145,8 @@ def probe(self, generator) -> List[garak.attempt.Attempt]:
attempts_todo.append(self._mint_attempt(prompt, seq))

# buff hook
attempts_todo = self._buff_hook(attempts_todo)
if len(_config.buffmanager.buffs) > 0:
attempts_todo = self._buff_hook(attempts_todo)

# iterate through attempts
attempts_completed = []
Expand Down

0 comments on commit 00c9604

Please sign in to comment.