This is a Go implementation of parallel tree walk and a suite of file system tools aiming for large-scale and performant profiling. Differing from Python-based pcircle and C++ based fprof, where both rely on MPI for inter-communication to implement cluster wide work stealing and distributed termination detection, pi is meant to run interactively on a single machine with good scaling properties.
On the OLCF's Summit production file system, a 250 PB, GPFS-based parallel file system, we measured over 200,000 ops/seconds scanning rate, on a single IBM POWER9 node running with 128 threads. It should be decent for regular use.
That said, HPC file system is infamous for extreme cases, such a a single shared directory with more 2 to 7 million files. It is difficult to handle this kind of shared directory if PFS doesn't implement distributed directory striping such as Lustre's DNE2 or GPFS's distributed meta node handling. It remains to be see if this is good enough for a full system scan.
Assuming you have golang installed and available on your PATH
(For example, brew install go
), all you need to do:
go get -u github.com/fwang2/pi
This will be the binary pi into your GOPATH
, by default, it is your $HOME/go/bin
.
On Summit (POWER arch)
module use /sw/exp9/spack/modules/linux-rhel7-power91e
On Rhea (x86)
module use /sw/exp9/linux-rhel7-sandybridge
Assume above module use is okay, then:
module load pi
will make pi available to use.
▶ pi topn .
▶ pi profile --hist .
--hist
is to build histogram of file distribution. It is turned off by default.
▶ pi find / --type f --size +100m --mtime 7d
pi
interpret 7d
the same as +7d
. To negate and search for changes within a week, use -7d
instead.
▶ pi zip /path/to/project -o project.tar.gz
This can be helpful if you have large files and feels tar/zip is taking too long. The compression itself is parallelized, but subsequent tar still have room to improve though.
▶ pi sparse-check /path/to/sparsefile