Skip to content
Christian Muise edited this page Apr 4, 2020 · 1 revision


Introduction

The utils library houses modules for simplifying the experimental process.



File Input / Output

There are a few basic file io functions available:

read_file(file_name) Read the contents of a text file into an array of strings.
write_file(file_name, contents) Write a string (or alternatively an array of strings) to a text file.
load_CSV(filename, delimiter = ',') Load a delimiter-separated-value file into a 2d array of strings.
Note: The delimiter argument is optional.
save_CSV(data, filename, delimiter = ',') Save a 2d array of items as a delimiter-separated-value file.
Note: The delimiter argument is optional, and the data items
will be converted to strings.

Additionally, the following function can be used to obtain a list of files in a directory (useful when running experiments with a benchmark set of examples):

  • get_file_list(dir_name, forbidden_list = None, match_list = None): Returns a list of files in the given directory subject to constraints.
  • dir_name: The path of the directory to locate files in.
  • forbidden_list: List of strings that, when matched to a filename, causes the file to be ignored. e.g. ['.svn', 'extra-directory', '.o', ...]
  • match_list: List of strings that the found files should have in their name. e.g. ['.foo', 'problem', ...]

Example

Say you have a directory foo/ with the following files: data1.csv, data2.csv, data3.csv, and readme.txt. Imagine you want to read each of the comma separated files in, and write them out as tab separated values, and display first 4 lines of the readme.txt file. The following code would achieve this:

from krrt.utils import read_file, load_CSV, save_CSV, get_file_list

#--- Load and print the first 4 lines of the readme.txt
readme_lines = read_file('foo/readme.txt')
print lines[:4]

#--- Locate all of the csv files
file_list = get_file_list('foo', forbidden_list = ['readme.txt'])
# Note: We could have used match_list=['.csv'] instead

#-- Iterate over each file
for file_name in file_list:
    #- Load the file as comma separated data
    data = load_CSV(file_name)

    #- Replace the .csv extension with .tsv
    new_file_name = file_name[:-4] + '.tsv'

    #- Write the file as tab separated data
    save_CSV(data, new_file_name, delimiter = "\t")


Running an Experiment

There is one main function used to simplify the setup of experimental evaluation: run_experiment. The function has a number of arguments, most of which are optional.

Arguments

  • base_directory: The base directory that the experiments should be run from. (default: ".")
  • base_command: The base command to be executed. This argument is mandatory.
  • single_arguments: A dictionary where the key is the name of an argument list (which is not included in the command), and the value is a list of arguments that should be used. For example if one (and only one) of flagA, flagB, and flagC should be included as a command-line option, then the key/value pair 'flags': ['flagA', 'flagB', 'flagC'] should be in the single_arguments dictionary. (default: None)
  • parameters: A dictionary where the key values are the command-line key name options, and the value is a list of command-line values for the associated key. For example, if the software being tested has -input <filename> as a command-line option then the dictionary would have an entry with the key '-input' and a value being a list of files for input. (default: None)
  • time_limit: The number of seconds the software should be permitted to run. (default: 15)
  • memory_limit: The number of megabytes the software should be limited to. (default: -1 (i.e. unlimited))
  • results_dir: Directory to store the output of each program execution. (default: "results")
  • progress_file: The file that should contain text indicating the progress of the experiment as a percentage. If None is passed in, standard output is used. (default: "/dev/null")
  • processors: The number of cores to be used simultaneously. (default: 1)

Results

The data structure returned by the run_experiment method tries to capture all of the information needed to filter results based on certain parameters. Returned is a ResultSet object that has the following functionality / attributes.

ResultSet

res_set.size The number of results contained.
res_set.get_ids() Returns a list of key's that can be used to select specific results.
res_set[id] Returns a Result object associated with id.
res_set.add_result(res) Adds a result object res to the ResultSet object.
res_set.filter_parameter(param, value) Returns a ResultSet with only the results that match the param / value pair specified.
res_set.filter_argument Returns a ResultSet with only the results that match the argument / value pair specified.
res_set.filter(func) Returns a ResultSet with only the results that pass a user-defined function pointer, func.

Note: The parameter and argument filter functions are just syntactic sugar for the generic filter function.

Result

The Result object contains information corresponding to a single run of your experiment. Specifically it has the following attributes:

result.id The id of the run (typically a number).
result.command The full command executed.
result.output_file The absolute path to the output captured from the command.
result.single_args A dictionary mapping argument names to the value for this run.
result.parameters A dictionary mapping parameter names to their setting for this run.
result.runtime The runtime for this command to complete.
result.timed_out A boolean value indicating whether or not this command timed out.

Example

from krrt.utils import run_experiment

# Run your program with different parameters, command-line arguments, etc
results = run_experiment(
    base_directory = '/path/to/command/',
    base_command = './command do_stuff',
    single_arguments = {
        'light_switch': ['-on', '-off'],
        'args': ['-arg1', '-arg2', '-arg3'],
        'flytype': ['-superfly', '']
      },
    parameters = {
        '-parameter_1': [5, 25, 100],
        '-parameter_2': [5, 25, 100],
        '-parameter_3': [.1, .25, .35]
      },
    time_limit = 900, # 15minute time limit (900 seconds)
    memory_limit = 1000, # 1gig memory limit (1000 megs)
    results_dir = "results",
    progress_file = None, # Print the progress to stdout
    processors = 8 # You've got 8 cores, right?
)

# (for whatever reason) Find all of the runs that had -superfly as an argument
superfly_results = results.filter_argument('flytype', '-superfly')

# Partition the results that didn't timeout into lists depending on -parameter_1
good_results = results.filter(lambda result: not result.timed_out, results)

p1_results = {}

for result in good_results:
    p1_results.setdefault(result.parameters['-parameter_1'], []).append(result)

# p1_results is now a dict with the keys '5', '25', and '100' and a list of
#  results corresponding to those values for -parameter_1


Parsing Output

The following functions are available for common parsing tasks that you may want to perform when building your experimental framework.

get_value

The get_value(file_name, regex, value_type = float) function is used to retrieve a single value from an output file.

Arguments

  • file_name: Path of the output file.
  • regex: Regex string that is used to match for the value. (e.g. .*size:(\d+).*)
  • value_type: (optional) Parameter to specify the type of the value (e.g. int)

Example

from krrt.utils import get_value

#--- Get the runtime from the file 'output' that is of the form "runtime:3.02sec"
runtime = get_value('output', '.*runtime:([0-9]+\.?[0-9]+)sec.*', float)

match_value

The match_value(file_name, regex) function is used to check if a regex appears inside a file anywhere.

Arguments

  • file_name: Path of the output file.
  • regex: Regex string that is used to match for the value. (e.g. .*Timeout.*)

Example

from krrt.utils import match_value

#--- Check if the file 'output' has the string "Timeout" inside of it.
timed_out = match_value('output', '.*Timeout.*')

get_lines

The get_lines(file_name, lower_bound = None, upper_bound = None) function is used to retrieve a contiguous sequence of lines from a file based on lines that surround the targeted text (non-inclusive). If lower_bound is not supplied, then all lines from the start of the file are included (similarly with upper_bound).

Arguments

  • file_name: Path of the output file.
  • lower_bound: (optional) Parameter for indicating the lower bounding line to match on.
  • upper_bound: (optional) Parameter for indicating the upper bounding line to match on.

Example

from krrt.utils import match_value

#--- Get the lines of the output file between the lines "start_results" and "end_results"
result_lines = get_lines('output', lower_bound = 'start_results', upper_bound = 'end_results')


Misc

Additionally the utils package provides the following functionality:

  • get_opts(): Returns a tuple (opts, flags) of command line parameters, where:
  • opts: Dictionary of options where the key is of the form -<option> and the value is just a string.
  • flags: List of strings that weren't part of an -<option> <value> pair.