-
Notifications
You must be signed in to change notification settings - Fork 448
DbPurge
As a BOINC project operates, the size of its workunit and result tables increase. To limit this growth, BOINC provides a utility db_purge that writes result and WU records to XML-format archive files, then deletes them from the database. Workunits are purged only when their input files have been deleted. Because of BOINC's file-deletion policy, this implies that all results are completed. So when a workunit is purged, all its results are purged too.
db_purge creates an archive/ directory and stores archive files there.
db_purge is normally run as a daemon, specified in the config.xml file. It has the following command-line options:
Purge only workunits of the given app.
Set logging verbosity to N (1,2,3,4)
Write archives in a new directory (YYYY_MM_DD) each day
Don't delete from DB, for testing only
Compress archive files using gzip
Purge at most N WUs, then exit
Write at most N WUs to each archive file. Recommended value: 10,000 or so.
Purge only WUs with mod_time at least X (can be < 1) days in the past. Recommended value: 7 or so. This lets users examine their recent results.
Process only WUs with ID mod N == R. This lets you run multiple instances for increased throughput. Must be used with --no_archive.
Don't archive workunits or results
Do one pass, then quit
Compress archive files using zip
Delete retired batches from the batch table
You may want to keep a record of which user returned a valid result for each workunit (e.g. so that you can give them credit for specific discoveries). If you use db_purge, this information is removed from the database; it can be extracted from the XML archives, but it's slow.
Alternatively, you can store the user/workunit association in a separate table called credited_job. The records in this table are small - they store only a user ID and a 64-bit integer which is taken from the workunit's 'opaque' field (typically an index in a separate database). By default, the table has indices that allow you to efficiently enumerate workunits for a given user and users for a given workunit.
To use the credited_job table, pass the --update_credited_job flag to your validator.
The archive files have names of the form wu_archive_TIME
and result_archive_TIME
where TIME
is the Unix time the file was created. In addition, db_purge generates index files 'wu_index' and 'result_index' associating each WU and result ID with the timestamp of its archive file.
The format of both type of index files is a number of rows each containing:
ID TIME NAME
The ID field of the WU or result, 5 spaces, the timestamp part of the archive filename where the record with that ID can be found, 4 spaces, and the name of the WU/result.
The format of a record in the result archive file is:
<result_archive>
<id>%d</id>
<create_time>%d</create_time>
<workunitid>%d</workunitid>
<server_state>%d</server_state>
<outcome>%d</outcome>
<client_state>%d</client_state>
<hostid>%d</hostid>
<userid>%d</userid>
<report_deadline>%d</report_deadline>
<sent_time>%d</sent_time>
<received_time>%d</received_time>
<name>%s</name>
<cpu_time>%.15e</cpu_time>
<xml_doc_in>%s</xml_doc_in>
<xml_doc_out>%s</xml_doc_out>
<stderr_out>%s</stderr_out>
<batch>%d</batch>
<file_delete_state>%d</file_delete_state>
<validate_state>%d</validate_state>
<claimed_credit>%.15e</claimed_credit>
<granted_credit>%.15e</granted_credit>
<opaque>%f</opaque>
<random>%d</random>
<app_version_num>%d</app_version_num>
<appid>%d</appid>
<exit_status>%d</exit_status>
<teamid>%d</teamid>
<priority>%d</priority>
<mod_time>%s</mod_time>
</result_archive>
The format of a record in the WU archive file is:
<workunit_archive>
<id>%d</id>
<create_time>%d</create_time>
<appid>%d</appid>
<name>%s</name>
<xml_doc>%s</xml_doc>
<batch>%d</batch>
<rsc_fpops_est>%.15e</rsc_fpops_est>
<rsc_fpops_bound>%.15e</rsc_fpops_bound>
<rsc_memory_bound>%.15e</rsc_memory_bound>
<rsc_disk_bound>%.15e</rsc_disk_bound>
<need_validate>%d</need_validate>
<canonical_resultid>%d</canonical_resultid>
<canonical_credit>%.15e</canonical_credit>
<transition_time>%d</transition_time>
<delay_bound>%d</delay_bound>
<error_mask>%d</error_mask>
<file_delete_state>%d</file_delete_state>
<assimilate_state>%d</assimilate_state>
<hr_class>%d</hr_class>
<opaque>%f</opaque>
<min_quorum>%d</min_quorum>
<target_nresults>%d</target_nresults>
<max_error_results>%d</max_error_results>
<max_total_results>%d</max_total_results>
<max_success_results>%d</max_success_results>
<result_template_file>%s</result_template_file>
<priority>%d</priority>
<mod_time>%s</mod_time>
</workunit_archive>
To extract old statistics, one has to recover the xml archive files (records get periodically removed from the results table by db_purge). This section explains a quick hack to recover the xml archive into a database. For now, only the result_archive is recreated, without xml data.
Note. This procedure relies on the LOAD XML INTO
MySQL command, which was introduced (about) in MySQL version 6.0 alpha. The 6.0 version was then removed from the MySQL roadmap, and must therefore be resurrected from sameplace on the net. Try e.g. looking for mysql-6.0.11-alpha-linux-x86_64-glibc23.tar.gz. It might be found e.g. at ftp://ftp.fu-berlin.de/unix/databases/mysql/Downloads/MySQL-6.0/ . MySQL 6.0 should not be used in the production DB, but only on a sandbox machine (called local henceforth).
Prerequisites:
- A local working installation of mysql 6.0 alpha
- A local image of the current BOINC database (e.g. restored from a recent dump)
Procedure:
- put the following script, boinc_xml_import, where it can be executed
# iterate over all arguments in the command line, unzip them,
# strip XML-in-XML and feed them to mysql for import
tmpf=/tmp/boinc_import.$$
for f in $*; do
echo Processing $f ...
# Remove the 3 xml snippets of each result. Treat the file as a huge string.
zcat -f $f | perl -w -e '
use strict;
my @a;
{ local $/=undef;
@a=<>; }
my $l="@a";
$l=~s|<xml_doc_in>.+?</stderr_out>| |sg;
print "$l";
' > $tmpf
mysql YOURPROJECT <<EOF
CREATE TABLE IF NOT EXISTS result_archive LIKE result;
SET AUTOCOMMIT=0;
SET UNIQUE_CHECKS=0;
SET FOREIGN_KEY_CHECKS=0;
LOAD XML CONCURRENT INFILE '$tmpf' IGNORE INTO TABLE result_archive
ROWS IDENTIFIED BY '<result_archive>';
SET FOREIGN_KEY_CHECKS=1;
SET UNIQUE_CHECKS=1;
COMMIT;
EOF
echo ... done
done
- Copy the .xml.gz archive files on a local machine, and run the following command (it will take a long time)
boinc_xml_import *.xml.gz
- At the end of the run, a result_archive table will be reconstructed on the server. To add the most recent results, contained in the result table, do (it will also take a while)
INSERT INTO result_archive SELECT * FROM result;