-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use XDG paths for configuration data and caching #799
use XDG paths for configuration data and caching #799
Conversation
garak/resources/common.py
Outdated
advbench_path = ( | ||
garak._config.transient.cache_dir | ||
/ "resources" | ||
/ "advbench" | ||
/ "harmful_behaviors.csv" | ||
) | ||
if advbench_base_path.is_file() and not advbench_path.is_file(): | ||
shutil.copy2(advbench_base_path, advbench_path) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future iteration helper methods from _config
for loading existing resource files may be appropriate, allowing the user to place files in the cache_dir or data_dir to override shipped versions.
garak/resources/gcg/generate_gcg.py
Outdated
@@ -64,7 +65,7 @@ | |||
gcg_parser.add_argument( | |||
"--outfile", | |||
type=str, | |||
default=gcg_resource_data / "gcg_prompts.txt", | |||
default=resource_data / "gcg_prompts.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is output, should this go to the cache location by default?
default=resource_data / "gcg_prompts.txt", | |
default=gcg_resource_data / "gcg_prompts.txt", |
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this, it may be reasonable to create utility functions for read and write of resource files, and enforce a coding standard of access where a read
file handle to a resource
is returned based on two levels of access, first searching for the file in XDG_DATA_HOME/garak/PATH_TO_RESOURCE_FILE
and falling back to return BASEDIR/PATH_TO_RESOURCE_FILE
when not found. As a standard coding practice for the project all resources
shipped with the project must be accessed through this utility.
Extending from this the standard for write
to resource files would always place the file in XDG_DATA_HOME/garak
to ensure the tool does not write to files in the installation path, but rather in the a user owned location. An additional benefit of this would be that the user could then override any resource
file by placing their own version in XDG_DATA_HOME/garak/PATH_TO_RESOURCE_FILE
. An immediate benefit of this would be seen in the suffix.GCGCached
probe by allowing the cached set of prompts to be overwritten using this pattern. This could be further extended to centralize common
as loadable files not embedded as strings in the codebase ensuring less churn unrelated to logic in the python
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern sounds great - sorry I missed this comment during the last review. The pattern fits multiple places. I guess writing should include a message in log/cli regarding the selected path. Also appreciate the benefits re: moving config out of source.
Def out of scope for this PR though, right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes out of scope here, I have ensured output
files are consistently in cache_dir
for now.
@@ -37,7 +37,8 @@ | |||
|
|||
logger = getLogger(__name__) | |||
|
|||
gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data" | |||
resource_data = garak._config.transient.basedir / "resources" | |||
gcg_resource_data = garak._config.transient.cache_dir / "resources" / "gcg" / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider updating suffix.GCGCached
to read from the cache location and copy from the project install if the file is not already in this cache_dir path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferring this idea for future iteration.
73eb8f0
to
cc75afc
Compare
Support using [XDG ver 0.8](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html) for project data. Specifically support: | ENV VAR | DEFAULT | |------------------|--------------------| | $XDG_DATA_HOME | $HOME/.local/share | | $XDG_CONFIG_HOME | $HOME/.config | | $XDG_CACHE_HOME | $HOME/.cache | Project name `garak` is appended to each location. This is represents the followina breaking changes to project expecations: * report_prefix passed either at the command line or as config file option * set filename values only * no longer overrides report_dir * report_dir passed as a config file option * when provided as a relative path will be prepend with `<xdg_data_home>/garak` * provided as an absolute path will be used as the output directory * default `user/site` configuration file `garak.site.yaml` has moved * previously `<basedir>/garak.site.yaml` * updated location `<xdg_config_home>/garak/garak.site.yaml` Additional changes (not considered breaking changes): * nltk data is placed in <xdg_cache_home>/garak if not already found in the environment * visual_jailbreak downloaded artifacts are placed in <xdg_cache_home>/garak/resources * generated data for beast/gcg/tap are placed in <xdg_cache_home>/garak/resources Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
7e2430a
to
bdc2a41
Compare
@@ -8,8 +8,10 @@ | |||
|
|||
|
|||
def start_logging(): | |||
from garak import _config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from garak import _config | |
""" initialises logging. assumes garak _config has already been loaded. """ | |
from garak import _config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider enforcing this vs documenting it?
from garak import _config | |
from garak import _config | |
if not _config.loaded: | |
raise RuntimeError("Configuration must be loaded to start logging!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer at this I don't think we need the warning at all, _config.transient.*_dir
are all currently defined at load by import
of the module and IMO starting logging
functionality should not be restricted on runtime config being finalized.
>>> from garak import _config
>>> _config.transient.data_dir
PosixPath('/Users/jemartin/.local/share/garak')
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed 0cad75c, enforcing loaded requirement is not really valid as noted in my comment above, the value consumed in this method is available on import of _config
.
garak/command.py
Outdated
@@ -41,19 +44,25 @@ def start_run(): | |||
"⚠️ The current/default config is optimised for speed rather than thoroughness. Try e.g. --config full for a stronger test, or specify some probes." | |||
) | |||
_config.transient.run_id = str(uuid.uuid4()) # uuid1 is safe but leaks host info | |||
# why is report prefix a condition of placing file in the report_dir? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm answering the right question, report_prefix
was flexible between being a simple string prepended to report filenames, to something containing a directory path. now that directory is explicitly handled separately within this pr, report_prefix
no longer needs to support directories (and probably shouldn't, in fact).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Awesome, this PR takes the approach of making report_prefix
a naming item only. I do like the idea of adding validation, that will ensure the value is not used for path traversal.
@@ -342,7 +347,7 @@ def run_beast( | |||
suffix_len: int = 40, | |||
data_size: int = 20, | |||
target: Optional[str] = "", | |||
outfile: str = beast_resource_data / "suffixes.txt", | |||
outfile: Path = beast_resource_data / "suffixes.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is (expensive) results from a BEAST run, should it go in xdg data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sold on that as this data is being built
on the system for the run and could be extracted from the .report.jsonl
prompts of a run that used them. There is currently no check for if the file exists prior to building it, and the data is only valid for the specific model it was used with. I do think there could be some future enhancement that makes this more data
class than cache
just not based the current implementation and scope of this PR.
A clear output value that documents to the user where this was placed may be appropriate in a future expansion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sold on that as this data is being built on the system for the run and could be extracted from the .report.jsonl prompts of a run that used them.
we have no code for this, though, right?
There is currently no check for if the file exists prior to building it, and the data is only valid for the specific model it was used with.
this suggests to me that it would make sense to place the output alongside run output data, using the same prefix
I do think there could be some future enhancement that makes this more data class than cache just not based the current implementation and scope of this PR.
yeah, this can def be split off into a separate issue for the related plugins. i do think we can get into trouble putting expensive results in a dir for transient content.
@@ -34,7 +35,7 @@ | |||
SAVE_RESULTS = True | |||
|
|||
resources_tap_data_file = ( | |||
garak._config.transient.basedir | |||
garak._config.transient.cache_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TAP has a pre-populated file in resources/tap/data that is distributed with garak. Maybe is makes sense to separate pre-loaded TAP resources from resources created while running TAP. The latter might belong in XDG data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_config.transient.data_dir | ||
/ _config.reporting.report_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to double check - does _config.reporting.report_dir
always have to be a subdir of _config.transient.data_dir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if report_dir is a fully resolved path it will be used as is, see the test_report_dir_full_path()
test in test_config.py
.
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
0cad75c
to
49bc1eb
Compare
Fix #479
Support using XDG ver 0.8 for project data.
Specifically support:
Project name
garak
is appended to each location.This is represents the followina breaking changes to project expecations:
<xdg_data_home>/garak
user/site
configuration filegarak.site.yaml
has moved<basedir>/garak.site.yaml
<xdg_config_home>/garak/garak.site.yaml
Additional changes (not considered breaking changes):
nltk
data is placed in<xdg_cache_home>/garak
if not already found in the environmentvisual_jailbreak
downloaded artifacts are placed in<xdg_cache_home>/garak/resources
generated data
forbeast/gcg/tap
are placed in<xdg_cache_home>/garak/resources