This project is used to analyze the trace results profiled via byteprofile a developed version of BytePS.
By choosing different --option
, this project supports the functionalities as shown below.
Set arg --option statistic
to show the statistic results, and arg --path
must be set to the exact trace file path (ending with .json
).
Set arg --option graph
, visualize the dependency dag graph. and arg --path
must be set to the exact DAG path (ending with .gml
).
Set arg --option combine
, this can be used to combine several trace files into one file, e.g., one worker may has two GPUs, each of which generates a trace file, you can use this option and list the paths of these two files using --path
.
There are two options to define the trace paths.
- Use file paths. In this case,
--path
should be a list of file paths, each of which denotes a trace file. The combined trace file will be stored under the same directory as the first trace file. - Use directory paths. In this case,
--path
is a list of directory paths, each of which denotes one worker and contains trace directories of GPUs on this worker. By default, the combined trace file will be stored under the first directory.
Note: please ensure that all of paths are file paths or all of them are diretory paths.
If you do not want combine all the traces, you can use --filter
to give a list communication operations seperated with comma, then only these communication operations will appear in the combined trace file. For now, the filter only supports communication nodes. An example is shown below.
python3 analyze.py --option combine --path ... --filter Comm.gradient_1,Comm.gradient_2
An example of combined timeline of 2 GPUs visualized by chrome trace tool is shown below, which uses mnist as the dataset, running on 2 worker, each with 2 V100 GPUs. Here the prefix Process 0
, 0
denotes the local rank of this GPU.
Set arg --option compare
. Similar to option combine
, the argument --path
could be a list of worker trace directories or a list of trace files. When a list of directories is given, traces on one worker will automatically be merged.
Besides, you can
- set
--xlsx
to export the comparison results to an XLSX file. - set
--sort
to sort the comparison results. - set
--head <number>
to display first<number>
of comparison results.
Set arg --option critical
, here --path
should be the root trace directory, by default, it's BYTEPS_TRACE_DIR
.
Note that, you must use the latest version of byteprofile to run this option.
Set arg --option replay
to replay the traces for one worker.
- Use
--path
to specify the path where the worker traces are stored. - Set
--del_queue
to include each partition and QueueType for communication traces. - Use
--step_num
to give the number of steps to replay. - Set
--pretty
to output necessary info.
Set arg --option collect
to update the final traces. In the meanwhile, the average iteration time would be outputed. --path
should be the root directory of a worker or a GPU.
--sub_option iter_time
, only calculate the iteration time and FW+BW time--sub_option operator
, update operator traces based on the source files.- others, re-combine all traces based on the source files.
Ignore partition id
pip3 packet: intervaltree, networkx, ujson, xlsxwriter, scapy