-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generator: NIM #637
generator: NIM #637
Conversation
See https://build.nvidia.com/explore/reasoning for items to test on |
Signed-off-by: Leon Derczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on having #645 landed we might refactor this initial support as OpenAICompatible
and later extend to add a NemoLLMCompatible
implementation.
There is a little more work needed beyond my comments here to enable, context_len
for unknown models to fallback to None
when not known to the tool since OpenAICompatible
currently just rejects any model without a known context length in the garak.generators.openai.context_lengths
dictionary.
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
… tests to catch handling behaviour
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update look good, some nice to have items noted.
@pytest.mark.skipif( | ||
os.getenv(NVOpenAIChat.ENV_VAR, None) is None, | ||
reason=f"NIM API key is not set in {NVOpenAIChat.ENV_VAR}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future update I would like to see these tests converted to sub in a mock response fixture when the ENV_VAR is not set.
Consider this tests, could assert the correct type is created by injecting a mock key similar to the test in test_ggml.py
Integration level tests that actually call out to a service can provide a mock when no key was in the ENV at launch and intercept the request call via request-mock
in this class otherwise allowing the code to still validate changes even when a true endpoint is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea, yes. I would also like to see this play better with generator internal api_keys, after the generator has validated / loaded its config, rather than going straight for the environment variable or _config
. Tagging #589 as a note that this can be considered & addressed when handling that issue.
garak/generators/test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something to address in this PR, however I am noticing this file is used for testing, I wonder if it is possible to move it into the tests path as a fixture only loaded during testing.
Is there added value in having this available all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. No, there's no added value - the primary purpose of the plugin package test
modules is to help integrators and users, not to ensure code integrity.
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]>
…rators not supporting multiple generations gave a double-wrapped list; add checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional testing passes however when executing the tests with a valid NIM_API_KEY
in place tests/test_attempt.py
reports an attempt to write to a file that no longer is closed.
======================================================================== FAILURES =========================================================================
_______________________________________________________________ test_attempt_sticky_params ________________________________________________________________
capsys = <_pytest.capture.CaptureFixture object at 0x33939e570>
def test_attempt_sticky_params(capsys):
> garak.cli.main(
"-m test.Blank -g 1 -p atkgen,dan.Dan_6_0 --report_prefix _garak_test_attempt_sticky_params".split()
)
tests/test_attempt.py:13:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
garak/cli.py:486: in main
command.probewise_run(generator, probe_names, evaluator, buff_names)
garak/command.py:212: in probewise_run
probewise_h.run(generator, probe_names, evaluator, buffs)
garak/harnesses/probewise.py:106: in run
h.run(model, [probe], detectors, evaluator, announce_probe=False)
garak/harnesses/base.py:124: in run
evaluator.evaluate(attempt_results)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <garak.evaluators.base.ThresholdEvaluator object at 0x3389ad370>, attempts = [<garak.attempt.Attempt object at 0x3394dbad0>]
def evaluate(self, attempts: List[garak.attempt.Attempt]) -> None:
"""
evaluate feedback from detectors
expects a list of attempts that correspond to one probe
outputs results once per detector
"""
if len(attempts) == 0:
logging.debug(
"evaluators.base.Evaluator.evaluate called with 0 attempts, expected 1+"
)
return
self.probename = attempts[0].probe_classname
detector_names = attempts[0].detector_results.keys()
for detector in detector_names:
all_passes = []
all_outputs = []
for attempt in attempts:
passes = [
1 if self.test(r) else 0
for r in map(float, attempt.detector_results[detector])
]
all_passes += passes
all_outputs += attempt.outputs
for idx, score in enumerate(attempt.detector_results[detector]):
if not self.test(score): # if we don't pass
if not _config.transient.hitlogfile:
if not _config.reporting.report_prefix:
hitlog_filename = f"{_config.reporting.report_dir}/garak.{_config.transient.run_id}.hitlog.jsonl"
else:
hitlog_filename = (
_config.reporting.report_prefix + ".hitlog.jsonl"
)
logging.info("hit log in %s", hitlog_filename)
_config.transient.hitlogfile = open(
hitlog_filename,
"w",
buffering=1,
encoding="utf-8",
)
trigger = None
if "trigger" in attempt.notes:
trigger = attempt.notes["trigger"]
elif "triggers" in attempt.notes:
if (
isinstance(attempt.notes["triggers"], list)
and len(attempt.notes["triggers"]) == 1
): # a list of one can be reported just as a string
trigger = attempt.notes["triggers"][0]
else:
trigger = attempt.notes["triggers"]
> _config.transient.hitlogfile.write(
json.dumps(
{
"goal": attempt.goal,
"prompt": attempt.prompt,
"output": attempt.outputs[idx],
"trigger": trigger,
"score": score,
"run_id": str(_config.transient.run_id),
"attempt_id": str(attempt.uuid),
"attempt_seq": attempt.seq,
"attempt_idx": idx,
"generator": f"{_config.plugins.model_type} {_config.plugins.model_name}",
"probe": self.probename,
"detector": detector,
"generations_per_prompt": _config.run.generations,
}
)
+ "\n" # generator,probe,prompt,trigger,result,detector,score,run id,attemptid,
)
E ValueError: I/O operation on closed file.
garak/evaluators/base.py:91: ValueError
------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------
garak LLM security probe v0.9.0.12.post1 ( https://github.com/leondz/garak ) at 2024-05-07T15:06:41.549252
📜 reporting to _garak_test_attempt_sticky_params.report.jsonl
🦜 loading generator: Test: Blank
🕵️ queue of probes: atkgen.Tox, dan.Dan_6_0
🔴🪖 🦜 loading generator: Hugging Face 🤗 pipeline: leondz/artgpt2tox
atkgen.Tox toxicity.ToxicCommentModel: PASS ok on 2/ 2
dan.Dan_6_0 dan.DAN: PASS ok on 1/ 1
------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------
================================================================= short test summary info =================================================================
FAILED tests/test_attempt.py::test_attempt_sticky_params - ValueError: I/O operation on closed file.
================================================== 1 failed, 908 passed, 5 skipped in 191.22s (0:03:11) ===================================================
This consistently fails in local testing on my end. However executing the single test file in isolation passes:
% python -m pytest tests/test_attempt.py
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.12.2, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/jemartin/Projects/nvidia/garak
configfile: pyproject.toml
plugins: anyio-4.3.0, requests-mock-1.12.1
collected 1 item
tests/test_attempt.py . [100%]
==================================================================== 1 passed in 4.77s ====================================================================
And fails in the same manor when skipped nim
tests execute first:
python -m pytest tests/generators/test_nim.py tests/test_attempt.py
This suggests there is some sort of data leakage or stale state in global config introduced by the skipped tests.
It seems reasonable to file a known issue with tests when NIM_API_KEY
is in the env to let this move forward. Other efforts are planned to adjust _config
scoping that may address this issue in #646.
Interesting. Thanks for writing this up. Yes, agree with your hypothesis. The failure may be related to these - tagging for reference:
If we're lucky, addressing #646 may clear up a bunch of these. |
Now getting the exact same test fail error, also in |
Add NIM generator support. This PR only covers NVIDIA-hosted NIMs