Skip to content

Common use case: daily process

Eric Gaudet edited this page Jun 1, 2014 · 5 revisions

Since the input files are read-only, the mechanism to update your data is not immediately apparent. A very common use case is when a process updating the same data set should occur daily.

The recommended setup is to have one directory called "input", containing the current data set, and a second directory called "output", where the process is going to write the updated data set. A third directory called "tmp" is also often useful for all the temporary data. At the end of the process, when the "output" directory is fully populated, the "tmp" directory can be deleted, the "input" directory is either deleted or renamed "input.old" for archiving, and the "output" directory is renamed "input", ready of the next day.

For best throughput, the "tmp" directory should be on a different physical disk than "input" and "output". This is because most Sisyphus programs will do first read(input) and write(tmp), then read(tmp) and write(output), and finally move output to input.

This setup is recommended for most Sisyphus programs, because it preserves and separates the input data from the output data. If there was any problem during the process, the "input" directory is still intact and the program can be easily retried, and debugged if necessary.

Previous: Other Useful Tools - Next: Example Implementation

Clone this wiki locally