In this repository, we provide the artifact for the paper Minuet: Accelerating 3D Sparse Convolution on GPUs.
It is expected to take about 2-3 hours (excluding datasets downloading) to finish all evaluations in the artifact.
The artifact should run on any hardware platforms with modern NVIDIA desktop/server GPUs (≥ 8 GB GPU memory), x86-64 CPUs, sufficient CPU memory (≥ 32 GB), and storage (≥ 150 GB). For reference, our experiments are mainly conducted with the following hardware specs.
- CPU: AMD Ryzen Threadripper 2920X
- GPU: NVIDIA GeForce RTX 3090 (TDP: 350W)
- Memory: 64 GB DDR4 RAM
- Storage: 256 GB Solid-State Drive (SSD)
The artifact should be executed under a Linux-based operating system with up-to-date NVIDIA Driver installed. For reference, we use the following software setups:
- OS: Ubuntu 20.04.5 LTS with Linux Kernel 5.15.0-82-generic
- NVIDIA Driver: 535.104.05
Besides, our experiments use NVIDIA Nsight Compute to measure the cache hit ratio, which requires accesses to GPU performance counters. If you experience no permission errors, please follow these instructions to grant permissions.
We provide the following two options to build this artifact:
We recommend to use Docker Engine for building the artifact to fully control
all software dependencies.
Please follow the instructions to
Install Docker Engine
and
NVIDIA Container Toolkit
first.
Note that if the current user is not in the docker
user group, all following
docker-related commands requires root
privilege (i.e. with sudo
) to run.
To build the docker image, we require to specify the GPU compute capability to
the docker image, since it is unknown during the building process of the docker
image. Fortunately, the nvidia-smi
tool provides a convenient way to check
the GPU compute capability:
# Assume we use the first GPU for evaluation, i.e. GPU 0
export GPU_ID=0
nvidia-smi -i $GPU_ID --query-gpu=compute_cap --format=csv
The following snippet can build the docker image on the first GPU with GPU_ID
:
# The compute capability can be also set manually.
# For reference:
# RTX 2070/2080Ti: export CUDA_ARCHS=7.5
# RTX 3090: export CUDA_ARCHS=8.6
# Tesla A100: export CUDA_ARCHS=8.0
export CUDA_ARCHS="$(nvidia-smi -i $GPU_ID --query-gpu=compute_cap --format=csv | tail -n 1)"
# We require the user id and group id to make sure the files we write on the
# mounted volumes are owned by the current user (i.e. can be cleaned up
# without sudo).
docker build \
--build-arg CUDA_ARCHS=$CUDA_ARCHS \
--build-arg UID=$(id -u) \
--build-arg GID=$(id -g) \
-t minuet .
After successfully built the docker image, execute the following command to launch a container for the following experiments/commands:
docker run -it --rm --gpus $GPU_ID -v "$(pwd):/workspace/artifacts" minuet
Please first make sure you have Anaconda installed as described here.
Please execute the following command to create a conda environment for this artifact and install all software dependencies.
conda create -n MinuetArtifacts
conda activate MinuetArtifacts
Then, please execute the following scripts in under the root of this artifact to install all software dependencies and this artifact:
bash ./automate/conda_install.sh
Conda puts the installed dynamic libraries under a different location from the system defaults, please use the following command to ensure they will be loaded correctly,
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
We also require the PYTHONPATH
environment variable to be set for the scripts to function correctly:
# We assume every following command will be running under the root of this artifact
export PYTHONPATH=.
We support evaluations on synthetic random datasets and the following real datasets. Due to each dataset has their own license, we intentionally only provide instructions to manually download the three datasets as follows:
- (Required, ~253 MB) Semantic3D Dataset
- Download
birdfountain_station1_xyz_intensity_rgb.7z
to thedata/
folder with the following command:wget http://www.semantic3d.net/data/point-clouds/testing1/birdfountain_station1_xyz_intensity_rgb.7z -P data
- Download
- (Optional, ~80 GB) SemanticKITTI Dataset
- Download the KITTI Odometry Benchmark Velodyne point clouds on
the official website
in the
data/
folder, where the downloaded file name should bedata_odometry_velodyne.zip
(direct link).
- Download the KITTI Odometry Benchmark Velodyne point clouds on
the official website
in the
- (Optional, ~5 GB) Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS)
- Fill the form to retrieve the download links on the official website (direct link).
- Download
Stanford3dDataset_v1.2_Aligned_Version.zip
and place it under thedata/
folder.
- (Optional, ~12 GB) ShapeNetSEM Dataset
- Fill the form to get the permission for downloading the ShapeNetSEM dataset here.
- Download
ShapeNetSem.zip
and place it under thedata/
folder.
Note that it is NOT required to download all datasets. However, the experiments depending on an unavailable dataset will be skipped, which could cause variability in Figure 10 and Figure 14 (see Step 3 for details).
Then, please run the following command to decompress and prepare all downloaded datasets:
bash automate/prepare_datasets.sh
We provide instructions for reproducing main performance numbers with the following experiments.
TL;DR: Use the following scripts to reproduce all figures for our main evaluations:
bash automate/run_all.sh
In the mapping step experiment E3 (i.e. 3_mapping_step.sh
), some baselines could produce
Out-Of-Memory (OOM) errors on GPUs with relatively small memory (e.g. RTX 2070 Super).
These errors are handled in our plotting scripts by producing a visualization plot with the
subset of the experiments that did not experience OOM error.
The major claims of Minuet C1-C3 (See the artifact appendix A.4.1) hold across different GPU models and datasets. However, when reproducing the experiments, the absolute numbers might have small fluctuations due to different execution conditions (e.g. GPU thermal management).
Additionally, as mentioned in Step 2, it is not necessary to download and prepare optional datasets. If the optional datasets are not used, this will cause variability in Figure 10 and Figure 14 since these plotted performance numbers are averaged over all downloaded and prepared datasets.
The following steps elaborates the details of each experiment.
The following command are used to generate tests for verifying that Minuet has the same outputs (within tolerable error) as other baseline frameworks among all datasets:
python3 scripts/verify.py \
-L <library> \ # Either "minkowski" or "torchsparse"
-D <dataset_config> \ # Path to any json file in the configs/ folder
-M <model> \ # Either "SparseResNet21D" or "SparseResUNet42" (MinkUNet42)
-T <number_of_tests> \
--eps <eps> # The threshold of testing equality of two floating numbers
For example:
# Verify Minuet's implementation of the SparseResNet21D model
# with MinkowskiEngine's implementation
python3 scripts/verify.py \
-L minkowski \
-D configs/semantic_kitti.json \
-M SparseResNet21D \
-T 5 \
--eps 1e-6
The scripts should generate similar outputs to the following:
Test 0:
Coordinates are equal? True
Features have less than 1e-06 error? True
This step automates the verifications, which makes it convenient to see Minuet's implementation of sparse convolution has exactly the same semantics as prior frameworks.
The following scripts collects raw performance numbers for end-to-end evaluations:
python3 scripts/benchmark_end_to_end.py \
-D <dataset_config> \ # Path to any json file in the configs/ folder
-L <library> \ # Either "minkowski", "torchsparse", or "minuet"
-M <model> # Either "SparseResNet21D" or "SparseResUNet42" (MinkUNet42)
For example:
python3 scripts/benchmark_end_to_end.py \
-D configs/semantic_kitti.json \
-L minuet \
-M SparseResNet21D
The command should generate a table similar to the following:
| Field | Value |
+=========+====================+
| latency | 12.548677363557816 |
The latency
field shows the average latency (in milliseconds) of the end-to-end
inference of the given model.
We provide a script to automatically collect all results needed to reproduce the end-to-end performance evaluation figure.
bash automate/1_end_to_end.sh
The generated figure will be at figures/figure9_end_to_end_speed_up.<gpu>.pdf
where <gpu>
denotes the GPU
used for the benchmarks.
The following scripts collects raw performance numbers for layer-wise evaluations.
python3 scripts/benchmark_layerwise.py \
-D <dataset_config> \ # Path to any json file in the configs/ folder
-L <library> \ # Either "minkowski", "torchsparse", or "minuet"
-K <kernel_size> \ # The kernel size for the layer
--channels <in_channels> <out_channels> # number of input and output channels
For example,
python3 scripts/benchmark_layerwise.py \
-D configs/semantic_kitti.json \
-L minuet \
-K 3 \
--channels 32 32
The command above generate a table similar to the following:
| Field | Value |
+==============+======================+
| latency_full | 0.8304032027721405 |
| latency_gmas | 0.45921599864959717 |
The latency_full
field shows the average latency (in milliseconds) of the layer
execution with the given channel sizes. The latency_gmas
field shows the latency of
the Gather-MatMul-Scatter step, which is used for Step 3.5.
We provide a script to automatically collect all results needed to reproduce the layer-wise performance evaluation figure.
bash automate/2_layerwise_gather_matmul_scatter_step.sh
The generated figure will be at figures/figure10_layer_wise_speedup.<gpu>.pdf
where <gpu>
denotes the GPU used for the benchmarks.
The following scripts collects raw performance numbers for the mapping step evaluations.
python3 scripts/benchmark_mapping.py \
-D <dataset_config> \ # Path to any json file in the configs/ folder
-L <library> \ # Either "minkowski", "torchsparse", or "minuet"
-K <kernel_size> # The kernel size for the layer
For example,
python3 scripts/benchmark_mapping.py \
-D configs/semantic_kitti.json \
-L minuet \
-K 3
will generate the following table:
| Field | Value |
+===============+=====================+
| latency_build | 0.18038080185651778 |
| latency_query | 0.14938880056142806 |
The latency_query
field shows the average latency (in milliseconds) of querying
(for building kernel maps).
The latency_build
field shows the average latency (in milliseconds) of building
the sorted tables (Minuet) or hash tables (MinkowskiEngine, TorchSparse) for
the queries.
We provide a script to automatically collect all results needed to reproduce the mapping step performance evaluation figure.
bash automate/3_mapping_step.sh
Along with the JSON-formatted results generated in folder results/
, three figures will
be generated according to the results, figures/figure11a_mapping_hit_ratio.<gpu>.pdf
,
figures/figure11b_mapping_query_time.<gpu>.pdf
, and figures/figure12_mapping_build_time.<gpu>.pdf
,
which respectively corresponds to the L2 hit ratio, query and build latencies.
If the Figure 11a is not generated, this is probably due to not having permission to access
the GPU performance counters. Please check if the ERR_NVGPUCTRPERM
error appears in any
of the results/*.csv
files and (if yes) follow the instructions in
Hardware & Software Requirements
to enable access to these performance counters.
As described in Step 3.3, the performance numbers for the Gather-Matmul-Scatter step is already collected in layer-wise evaluations. Thus, in this step no further action is required.
The scripts for reproducing the layer-wise performance figure will also generate the figure
for the evaluations of the Gather-MatMul-Scatter step, which is located at
figures/figure14_gather_gemm_scatter_layerwise_speedup.<gpu>.pdf
where <gpu>
denotes the GPU used for the benchmarks.
Please refer to LICENSE for details.