-
Notifications
You must be signed in to change notification settings - Fork 151
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
probe: assess 🤗 repos for potentially malicious files (#767)
* draft probe for scanning file formats * draft probe for scanning file formats * accept generators in detector results, allowing 'yield' * correctly clear up detector result dict in attempt * relax some constraints around detector detect() return type * remove dead code * list can be set successfully from within iteration, generators less so * add detector for possible pickle file extensions * add probe, detector, and doc stubs for checking for pickle extensions * add possiblepickle test, fix multiple yield bug * add links to docs * add hub requirement * rollback * overwrite openai options * Revert "overwrite openai options" This reverts commit c9aab98. Signed-off-by: Jeffrey Martin <[email protected]> * add deps, prune import * scan hf repo contents for pickles * shift responsibility for file fetching to probe, making detector more generic * check attempt format, check if path is file * tighten data validation * add FileIsPickle test for: format; non-pickled data; default protocol pickled data; fixed protocol pickled data * refactor fileformat detector, add rudimentary executable checking * refactor fileformat detector, add rudimentary executable checking * fileformats test executable file excerpts * add python-magic dep * fix dep name * add type sigs, allow format check skipping * possiblepickledetector is now a filedetector * rm stray debug line Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * log non-isfile() entries in filedetector * test skipping behaviour for FileDetector * fix missing import * detectors can return generators * move FileDetector over to all_outputs; pause extended testing of all_outputs detection result for FileDetectors * rm debug print Co-authored-by: Erick Galinkin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * rm gcg doc link * test list len * distribute base64 versions of truncated bins * convert string path to a Path before unlink()ing * avoid attempt w/ empty prompt; use centralised colours; add fileformats tests * ensure decoded binary cleanup in teardown * restore comma lost in merge * convert tempfile name to path before calling unlink() * add exe mimetype found in osx libmagic impl * use bin-including magic on win, osx * python-magic-bin only compatible with intel osx. add libmagic install to macos test deps; only use python-magic-bin for win * make sure testing model doesn't go onto MPS (insufficient memory on gh's shared setup) * handle case where brew/other external dep is required (reqs should install a bin on win, linux is ok) * strip space Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * update goal from default generic value Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * trim whitespace Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * gate testing on correctly-loaded `magic` lib to workaround libmagic install portability Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * `==` -> `=` for assignment Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * defer handling of import * set version restriction for python-magic-bin Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> * version restr. for python-magic-bin in requirements.txt Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> --------- Signed-off-by: Jeffrey Martin <[email protected]> Signed-off-by: Leon Derczynski <[email protected]> Co-authored-by: Jeffrey Martin <[email protected]> Co-authored-by: Erick Galinkin <[email protected]>
- Loading branch information
1 parent
6ae313c
commit 52441af
Showing
22 changed files
with
792 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.detectors.fileformats | ||
=========================== | ||
|
||
.. automodule:: garak.detectors.fileformats | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.probes.fileformats | ||
======================== | ||
|
||
.. automodule:: garak.probes.fileformats | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# SPDX-FileCopyrightText: Portions Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
"""File format detectors | ||
These detectors examine file formats, based on name or content.""" | ||
|
||
import importlib | ||
import logging | ||
import pickletools | ||
|
||
from garak import _config | ||
from garak.detectors.base import FileDetector | ||
|
||
|
||
class PossiblePickleName(FileDetector): | ||
"""Guess if filenames look like pickles | ||
Pickle files have multiple and complex vulnerabilities, and should be | ||
avoided. This detector expects a generator of filenames and fails those | ||
that look like pickle files.""" | ||
|
||
valid_format = None | ||
|
||
def _test_file(self, filename: str) -> None | str: | ||
if filename.lower().endswith(".pkl"): | ||
return 1.0 | ||
elif filename.lower().endswith(".pickle"): | ||
return 1.0 | ||
else: | ||
return 0.0 | ||
|
||
|
||
class FileIsPickled(FileDetector): | ||
""" | ||
Check whether a file is pickled. | ||
""" | ||
|
||
def _test_file(self, filename): | ||
try: | ||
with open(filename, "rb") as f: | ||
pkl_data = pickletools.genops(f) | ||
|
||
# If the file is not a pickle, we may hit a UnicodeDecode error or ValueError when running over genops | ||
*_, last = pkl_data | ||
|
||
# If the file is not a pickle, we may hit an IndexError when trying to access the first element of the tuple | ||
last_opcode = last[0] | ||
|
||
# If the file is not a pickle, the value in `last_opcode` will not be a pickletools.OpcodeInfo object with the `name` attribute | ||
opcode_name = last_opcode.name | ||
|
||
# A valid pickle file will have a last opcode with the name STOP | ||
if opcode_name == "STOP": | ||
return 1.0 | ||
else: | ||
return 0.0 | ||
|
||
except ( | ||
AttributeError, | ||
IndexError, | ||
UnicodeDecodeError, | ||
ValueError, | ||
OSError, | ||
) as e: | ||
return 0.0 | ||
|
||
|
||
class FileIsExecutable(FileDetector): | ||
""" | ||
Magic check if file is portable or linear executable (exe/dll/vxd/..) | ||
""" | ||
|
||
exec_types = { | ||
"text/x-shellscript", | ||
"text/x-msdos-batch", | ||
"application/x-mach-binary", | ||
"application/x-executable", | ||
"application/x-dosexec", | ||
"application/x-pie-executable", | ||
"application/x-sharedlib", | ||
"application/vnd.microsoft.portable-executable", | ||
} | ||
|
||
def __init__(self, config_root=_config): | ||
super().__init__(config_root) | ||
try: | ||
self.magic = importlib.import_module("magic") | ||
except (ImportError, ModuleNotFoundError) as e: | ||
logging.info( | ||
"detectors.fileformats: failed importing python-magic, try installing libmagic, e.g. `brew install libmagic`", | ||
e, | ||
) | ||
self.magic = None | ||
|
||
def _test_file(self, filename): | ||
if self.magic is None: | ||
return None | ||
with open(filename, "rb") as f: | ||
m = self.magic.Magic(mime=True) | ||
header = f.read(2048) | ||
mimetype = m.from_buffer(header) | ||
return 1.0 if mimetype in self.exec_types else 0.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# SPDX-FileCopyrightText: Portions Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
"""File formats probe, looking for potentially vulnerable files. | ||
Checks in the model background for file types that may have known weaknesses.""" | ||
|
||
import logging | ||
from typing import Iterable | ||
|
||
import huggingface_hub | ||
import tqdm | ||
|
||
from garak.configurable import Configurable | ||
from garak.probes.base import Probe | ||
import garak.attempt | ||
import garak.resources.theme | ||
|
||
class HF_Files(Probe, Configurable): | ||
"""Get a manifest of files associated with a Hugging Face generator | ||
This probe returns a list of filenames associated with a Hugging Face | ||
generator, if that applies to the generator. Not enabled for all types, | ||
e.g. some endpoints.""" | ||
|
||
tags = ["owasp:llm05"] | ||
goal = "get a list of files associated with the model" | ||
|
||
# default detector to run, if the primary/extended way of doing it is to be used (should be a string formatted like recommended_detector) | ||
primary_detector = "fileformats.FileIsPickled" | ||
extended_detectors = [ | ||
"fileformats.FileIsExecutable", | ||
"fileformats.PossiblePickleName", | ||
] | ||
|
||
supported_generators = {"Model", "Pipeline", "OptimumPipeline", "LLaVA"} | ||
|
||
# support mainstream any-to-any large models | ||
# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d' | ||
# refer to Table 1 in https://arxiv.org/abs/2401.13601 | ||
# we focus on LLM input for probe | ||
modality: dict = {"in": {"text"}} | ||
|
||
def probe(self, generator) -> Iterable[garak.attempt.Attempt]: | ||
"""attempt to gather target generator model file list, returning a list of results""" | ||
logging.debug("probe execute: %s", self) | ||
|
||
package_path = generator.__class__.__module__ | ||
if package_path.split(".")[-1] != "huggingface": | ||
return [] | ||
if generator.__class__.__name__ not in self.supported_generators: | ||
return [] | ||
attempt = self._mint_attempt(generator.name) | ||
|
||
repo_filenames = huggingface_hub.list_repo_files(generator.name) | ||
local_filenames = [] | ||
for repo_filename in tqdm.tqdm( | ||
repo_filenames, | ||
leave=False, | ||
desc=f"Gathering files in {generator.name}", | ||
colour=f"#{garak.resources.theme.PROBE_RGB}", | ||
): | ||
local_filename = huggingface_hub.hf_hub_download( | ||
generator.name, repo_filename, force_download=False | ||
) | ||
local_filenames.append(local_filename) | ||
|
||
attempt.notes["format"] = "local filename" | ||
attempt.outputs = local_filenames | ||
|
||
logging.debug("probe return: %s with %s filenames", self, len(local_filenames)) | ||
|
||
return [attempt] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.