Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document Probe.probe(); skip a buff hook if no buffs #527

Merged
merged 1 commit into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/source/garak.probes.base.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,49 @@
garak.probes.base
=================

Probes inherit from garak.probes.base.Probe.

Functions:

1. **__init__()**: Class constructor. Call this from probes after doing local init. It does things like setting `probename`, setting up the description automatically from the class docstring, and logging probe instantiation.


2. **probe()**. This function is responsible for the interaction between the probe and the generator. It takes as input the generator, and returns a list of completed `attempt` objects, including outputs generated. `probe()` orchestrates all interaction between the probe and the generator. Because a fair amount of logic is concentrated here, hooks into the process are provided, so one doesn't need to override the `probe()` function itself when customising probes.

The general flow in `probe()` is:

* Create a list of `attempt` objects corresponding to the prompts in the probe, using `_mint_attempt()`. Prompts are iterated through and passed to `_mint_attempt()`. The `_mint_attempt()` function works by converting a prompt to a full `attempt` object, and then passing that `attempt` object through `_attempt_prestore_hook()`. The result is added to a list in `probe()` called `attempts_todo`.
* If any buffs are loaded, the list of attempts is passed to `_buff_hook()` for transformation. `_buff_hook()` checks the config and then creates a new attempt list, `buffed_attempts`, which contains the results of passing each original attempt through each instantiated buff in turn. Instantiated buffs are tracked in `_config.buffmanager.buffs`. Once `buffed_attempts` is populated, it's returned, and overwrites `probe()`'s `attempts_todo`.
* At this point, `probe()`` is ready to start interacting with the generator. An empty list `attempts_completed` is set up to hold completed results.
* If configured, parallelisation of attempt processing is set up using `multiprocessing`. The relevant config variable is `_config.system.parallel_attempts` and the value should be greater than 1 (1 in parallel is just serial).
* Attempts are iterated through (ether in parallel or serial) and individually posed to the generator using `_execute_attempt()`.
* The process of putting one `attempt` through the generator is orchestrated by `_execute_attempt()`, and runs as follows:

* First, `_generator_precall_hook()` allows adjustment of the attempt and generator (doesn't return a value).
* Next, the prompt of the attempt (`this_attempt.prompt`) is passed to the generator's `generate()` function. Results are stored in the attempt's `outputs` attribute.
* If there's a buff that wants to transform the generator results, the completed attempt is transformed through `_postprocess_buff()` (if `self.post_buff_hook == True`).
* The completed attempt is passed through a post-processing hook, `_postprocess_hook()`.
* A string of the completed attempt is logged to the report file.
* A deepcopy of the attempt is returned.

* Once done, the result of `_execute_attempt()` is added to `attempts_completed`.
* Finally, `probe()` logs completion and returns the list of processed attempts from `attempts_completed`.

3. **_attempt_prestore_hook()**. Called when creating a new attempt with `_mint_attempt()`. Can be used to e.g. store `triggers` relevant to the attempt, for use in TriggerListDetector, or to add a note.

4. **_buff_hook()**. Called from `probe()` to buff attempts after the list in `attempts_todo` is populated.

5. **_execute_attempt()**. Called from `probe()` to orchestrate processing of one attempt by the generator.

6. **_generator_precall_hook()**. Called at the start of `_execute_attempt()` with attempt and generator. Can be used to e.g. adjust generator parameters.

7. **_mint_attempt()**. Converts a prompt to a new attempt object, managing metadata like attempt status and probe classname.

8. **_postprocess_buff()**. Called in `_execute_attempt()` after results come back from the generator, if a buff specifies it. Used to e.g. translate results back if already translated to another language.

9. **_postprocess_hook()**. Called near the end of `_execute_attempt()` to apply final postprocessing to attempts after generation. Can be used to restore state, e.g. if generator parameters were adjusted, or to clean up generator output.


.. automodule:: garak.probes.base
:members:
:undoc-members:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/probes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ garak.probes
garak's probes each define a number of ways of testing a generator (typically an LLM)
for a specific vulnerability or failure mode.

For a detailed oversight into how a probe operates, see :doc:`garak.probes.base.rst`.

.. toctree::
:maxdepth: 2

Expand Down
16 changes: 14 additions & 2 deletions garak/probes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ class Probe:
post_buff_hook = False # Keeps state of whether a buff is loaded that requires a call to untransform model outputs

def __init__(self):
"""Sets up a probe. This constructor:
1. populates self.probename based on the class name,
2. logs and optionally prints the probe's loading,
3. populates self.description based on the class docstring if not yet set"""
self.probename = str(self.__class__).split("'")[1]
if hasattr(_config.system, "verbose") and _config.system.verbose > 0:
print(
Expand All @@ -53,6 +57,8 @@ def __init__(self):
def _attempt_prestore_hook(
self, attempt: garak.attempt.Attempt, seq: int
) -> garak.attempt.Attempt:
"""hook called when a new attempt is registered, allowing e.g.
systematic transformation of attempts"""
return attempt

def _generator_precall_hook(self, generator, attempt=None):
Expand Down Expand Up @@ -88,6 +94,8 @@ def _buff_hook(

@staticmethod
def _postprocess_buff(attempt: garak.attempt.Attempt) -> garak.attempt.Attempt:
"""hook called immediately after an attempt has been to the generator,
buff de-transformation; gated on self.post_buff_hook"""
for buff in _config.buffmanager.buffs:
if buff.post_buff_hook:
attempt = buff.untransform(attempt)
Expand All @@ -96,9 +104,11 @@ def _postprocess_buff(attempt: garak.attempt.Attempt) -> garak.attempt.Attempt:
def _postprocess_hook(
self, attempt: garak.attempt.Attempt
) -> garak.attempt.Attempt:
"""hook called to process completed attempts; always called"""
return attempt

def _mint_attempt(self, prompt, seq=None) -> garak.attempt.Attempt:
"""function for creating a new attempt given a prompt"""
new_attempt = garak.attempt.Attempt()
new_attempt.prompt = prompt
new_attempt.probe_classname = (
Expand All @@ -113,12 +123,13 @@ def _mint_attempt(self, prompt, seq=None) -> garak.attempt.Attempt:
return new_attempt

def _execute_attempt(self, this_attempt):
"""handles sending an attempt to the generator, postprocessing, and logging"""
self._generator_precall_hook(self.generator, this_attempt)
this_attempt.outputs = self.generator.generate(this_attempt.prompt)
if self.post_buff_hook:
this_attempt = self._postprocess_buff(this_attempt)
_config.transient.reportfile.write(json.dumps(this_attempt.as_dict()) + "\n")
this_attempt = self._postprocess_hook(this_attempt)
_config.transient.reportfile.write(json.dumps(this_attempt.as_dict()) + "\n")
return copy.deepcopy(this_attempt)

def probe(self, generator) -> List[garak.attempt.Attempt]:
Expand All @@ -134,7 +145,8 @@ def probe(self, generator) -> List[garak.attempt.Attempt]:
attempts_todo.append(self._mint_attempt(prompt, seq))

# buff hook
attempts_todo = self._buff_hook(attempts_todo)
if len(_config.buffmanager.buffs) > 0:
attempts_todo = self._buff_hook(attempts_todo)

# iterate through attempts
attempts_completed = []
Expand Down
Loading