-
Notifications
You must be signed in to change notification settings - Fork 2
utils
The utils library houses modules for simplifying the experimental process.
There are a few basic file io functions available:
read_file(file_name) | Read the contents of a text file into an array of strings. |
write_file(file_name, contents) | Write a string (or alternatively an array of strings) to a text file. |
load_CSV(filename, delimiter = ',') | Load a delimiter-separated-value file into a 2d array of strings. Note: The delimiter argument is optional. |
save_CSV(data, filename, delimiter = ',') | Save a 2d array of items as a delimiter-separated-value file. Note: The delimiter argument is optional, and the data items will be converted to strings. |
Additionally, the following function can be used to obtain a list of files in a directory (useful when running experiments with a benchmark set of examples):
- get_file_list(dir_name, forbidden_list = None, match_list = None): Returns a list of files in the given directory subject to constraints.
- dir_name: The path of the directory to locate files in.
- forbidden_list: List of strings that, when matched to a filename, causes the file to be ignored. e.g. ['.svn', 'extra-directory', '.o', ...]
- match_list: List of strings that the found files should have in their name. e.g. ['.foo', 'problem', ...]
Say you have a directory foo/ with the following files: data1.csv, data2.csv, data3.csv, and readme.txt. Imagine you want to read each of the comma separated files in, and write them out as tab separated values, and display first 4 lines of the readme.txt file. The following code would achieve this:
from krrt.utils import read_file, load_CSV, save_CSV, get_file_list #--- Load and print the first 4 lines of the readme.txt readme_lines = read_file('foo/readme.txt') print lines[:4] #--- Locate all of the csv files file_list = get_file_list('foo', forbidden_list = ['readme.txt']) # Note: We could have used match_list=['.csv'] instead #-- Iterate over each file for file_name in file_list: #- Load the file as comma separated data data = load_CSV(file_name) #- Replace the .csv extension with .tsv new_file_name = file_name[:-4] + '.tsv' #- Write the file as tab separated data save_CSV(data, new_file_name, delimiter = "\t")
There is one main function used to simplify the setup of experimental evaluation: run_experiment. The function has a number of arguments, most of which are optional.
- base_directory: The base directory that the experiments should be run from. (default: ".")
- base_command: The base command to be executed. This argument is mandatory.
- single_arguments: A dictionary where the key is the name of an argument list (which is not included in the command), and the value is a list of arguments that should be used. For example if one (and only one) of flagA, flagB, and flagC should be included as a command-line option, then the key/value pair 'flags': ['flagA', 'flagB', 'flagC'] should be in the single_arguments dictionary. (default: None)
- parameters: A dictionary where the key values are the command-line key name options, and the value is a list of command-line values for the associated key. For example, if the software being tested has -input <filename> as a command-line option then the dictionary would have an entry with the key '-input' and a value being a list of files for input. (default: None)
- time_limit: The number of seconds the software should be permitted to run. (default: 15)
- memory_limit: The number of megabytes the software should be limited to. (default: -1 (i.e. unlimited))
- results_dir: Directory to store the output of each program execution. (default: "results")
- progress_file: The file that should contain text indicating the progress of the experiment as a percentage. If None is passed in, standard output is used. (default: "/dev/null")
- processors: The number of cores to be used simultaneously. (default: 1)
The data structure returned by the run_experiment method tries to capture all of the information needed to filter results based on certain parameters. Returned is a ResultSet object that has the following functionality / attributes.
res_set.size | The number of results contained. |
res_set.get_ids() | Returns a list of key's that can be used to select specific results. |
res_set[id] | Returns a Result object associated with id. |
res_set.add_result(res) | Adds a result object res to the ResultSet object. |
res_set.filter_parameter(param, value) | Returns a ResultSet with only the results that match the param / value pair specified. |
res_set.filter_argument | Returns a ResultSet with only the results that match the argument / value pair specified. |
res_set.filter(func) | Returns a ResultSet with only the results that pass a user-defined function pointer, func. |
Note: The parameter and argument filter functions are just syntactic sugar for the generic filter function.
The Result object contains information corresponding to a single run of your experiment. Specifically it has the following attributes:
result.id | The id of the run (typically a number). |
result.command | The full command executed. |
result.output_file | The absolute path to the output captured from the command. |
result.single_args | A dictionary mapping argument names to the value for this run. |
result.parameters | A dictionary mapping parameter names to their setting for this run. |
result.runtime | The runtime for this command to complete. |
result.timed_out | A boolean value indicating whether or not this command timed out. |
from krrt.utils import run_experiment # Run your program with different parameters, command-line arguments, etc results = run_experiment( base_directory = '/path/to/command/', base_command = './command do_stuff', single_arguments = { 'light_switch': ['-on', '-off'], 'args': ['-arg1', '-arg2', '-arg3'], 'flytype': ['-superfly', ''] }, parameters = { '-parameter_1': [5, 25, 100], '-parameter_2': [5, 25, 100], '-parameter_3': [.1, .25, .35] }, time_limit = 900, # 15minute time limit (900 seconds) memory_limit = 1000, # 1gig memory limit (1000 megs) results_dir = "results", progress_file = None, # Print the progress to stdout processors = 8 # You've got 8 cores, right? ) # (for whatever reason) Find all of the runs that had -superfly as an argument superfly_results = results.filter_argument('flytype', '-superfly') # Partition the results that didn't timeout into lists depending on -parameter_1 good_results = results.filter(lambda result: not result.timed_out, results) p1_results = {} for result in good_results: p1_results.setdefault(result.parameters['-parameter_1'], []).append(result) # p1_results is now a dict with the keys '5', '25', and '100' and a list of # results corresponding to those values for -parameter_1
The following functions are available for common parsing tasks that you may want to perform when building your experimental framework.
The get_value(file_name, regex, value_type = float) function is used to retrieve a single value from an output file.
- file_name: Path of the output file.
- regex: Regex string that is used to match for the value. (e.g. .*size:(\d+).*)
- value_type: (optional) Parameter to specify the type of the value (e.g. int)
from krrt.utils import get_value #--- Get the runtime from the file 'output' that is of the form "runtime:3.02sec" runtime = get_value('output', '.*runtime:([0-9]+\.?[0-9]+)sec.*', float)
The match_value(file_name, regex) function is used to check if a regex appears inside a file anywhere.
- file_name: Path of the output file.
- regex: Regex string that is used to match for the value. (e.g. .*Timeout.*)
from krrt.utils import match_value #--- Check if the file 'output' has the string "Timeout" inside of it. timed_out = match_value('output', '.*Timeout.*')
The get_lines(file_name, lower_bound = None, upper_bound = None) function is used to retrieve a contiguous sequence of lines from a file based on lines that surround the targeted text (non-inclusive). If lower_bound is not supplied, then all lines from the start of the file are included (similarly with upper_bound).
- file_name: Path of the output file.
- lower_bound: (optional) Parameter for indicating the lower bounding line to match on.
- upper_bound: (optional) Parameter for indicating the upper bounding line to match on.
from krrt.utils import match_value #--- Get the lines of the output file between the lines "start_results" and "end_results" result_lines = get_lines('output', lower_bound = 'start_results', upper_bound = 'end_results')
Additionally the utils package provides the following functionality:
- get_opts(): Returns a tuple (opts, flags) of command line parameters, where:
- opts: Dictionary of options where the key is of the form -<option> and the value is just a string.
- flags: List of strings that weren't part of an -<option> <value> pair.