🚨 February 2024: Important Sparsify Update
The Neural Magic team is pausing the Sparsify Alpha at this time. We are refocusing efforts around a new exciting project to be announced in the coming months. Thank you for your continued support and stay tuned
🚨 October 2023: Important Sparsify Announcement
Given our new focus on enabling sparse large language models (LLMs) to run competitively on CPUs, Sparsify Alpha is undergoing upgrades to focus on fine-tuning and optimizing LLMs. This means that we will no longer be providing bug fixes, prioritizing support, or building new features and integrations for non-LLM flows including the CV and NLP Sparsify Pathways.
Neural Magic is super excited about these new efforts in building Sparsify into the best LLM fine-tuning and optimization tool on the market over the coming months and we cannot wait to share more soon. Thanks for your continued support!
🚨 July 2023: Sparsify's next generation is now in alpha as of version 1.6.0!
Sparsify enables you to accelerate inference without sacrificing accuracy by applying state-of-the-art pruning, quantization, and distillation algorithms to neural networks with a simple web application and one-command API calls.
Sparsify empowers you to compress models through two components:
- Sparsify Cloud - a web application that allows you to create and manage Sparsify Experiments, explore hyperparameters, predict performance, and compare results across both Experiments and deployment scenarios.
- Sparsify CLI/API - a Python package and GitHub repository that allows you to run Sparsify Experiments locally, sync with the Sparsify Cloud, and integrate them into your workflows.
This quickstart details several pathways you can work through. We encourage you to explore one for Sparsify's full benefits. When you finish the quickstart, sparsifying your models is as easy as:
sparsify.run sparse-transfer --use-case image-classification --data imagenette --optim-level 0.5
First, verify that you have the correct software and hardware to run the Sparsify Alpha.
Software
Sparsify is tested on Python 3.8 and 3.10, ONNX 1.5.0-1.12.0, ONNX opset version 11+, and manylinux compliant systems. Sparsify is not supported natively on Windows and MAC OS.
Additionally, for installation from PyPi, pip 20.3+ is required.
Hardware
Sparsify requires a GPU with CUDA + CuDNN in order to sparsify neural networks. We recommend you use a Linux system with a GPU that has a minimum of 16GB of GPU Memory, 128GB of RAM, 4 CPU cores, and is CUDA-enabled. If you are sparsifying a very large model, you may need more RAM than the recommended 128GB. If you encounter issues setting up your training environment, file a GitHub issue.
Creating a new one-time account is simple and free.
An account is required to manage your Experiments and API keys.
Visit the Neural Magic's Web App Platform and create an account by entering your email, name, and unique password.
If you already have a Neural Magic Account, sign in with your email.
pip
is the preferred method for installing Sparsify.
It is advised to create a fresh virtual environment to avoid dependency issues.
Install with pip using:
pip install sparsify-nightly
Next, with Sparsify installed on your training hardware:
- Authorize the local CLI to access your account by running the sparsify.login command and providing your API key.
- Locate your API key on the homepage of the Sparsify Cloud under the 'Get set up' modal, and copy the command or the API key itself.
- Run the following command:
sparsify.login API_KEY
Experiments are the core of sparsifying a model. They allow you to apply sparsification algorithms to a dataset and model through the three Experiment types detailed below:
All Experiments are run locally on your training hardware and can be synced with the cloud for further analysis and comparison, using Sparsify's two components:
- Sparsify Cloud - explore hyperparameters, predict performance, and generate the desired CLI/API command.
- Sparsify CLI/API - run an experiment.
Sparsity | Sparsification Speed | Accuracy |
---|---|---|
++ | +++++ | +++ |
One-Shot Experiments quickly sparsify your model post-training, providing a 3-5x speedup with minimal accuracy loss, ideal for quick model optimization without retraining your model.
To run a One-Shot Experiment for your model, dataset, and use case, use the following command:
sparsify.run one-shot --use-case USE_CASE --model MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparsify a ResNet-50 model on the ImageNet dataset for image classification, run the following commands:
wget https://public.neuralmagic.com/datasets/cv/classification/imagenet_calibration.tar.gz
tar -xzf imagenet_calibration.tar.gz -C ./imagenet_calibration
sparsify.run one-shot --use-case image_classification --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none" --data ./imagenet_calibration --optim-level 0.5
Or, to sparsify a BERT model on the SST2 dataset for sentiment analysis, run the following commands:
wget https://public.neuralmagic.com/datasets/nlp/text_classification/sst2_bert_calibration.tar.gz
tar -xzf sst2_bert_calibration.tar.gz
sparsify.run one-shot --use-case text_classification --model "zoo:nlp/sentiment_analysis/bert-base/pytorch/huggingface/sst2/base-none" --data --data ./sst2_bert_calibration --optim-level 0.5
To dive deeper into One-Shot Experiments, read through the One-Shot Experiment Guide.
Note, One-Shot Experiments currently require the model to be in an ONNX format and the dataset to be in a NumPy format. More details are provided in the One-Shot Experiment Guide.Sparsity | Sparsification Speed | Accuracy |
---|---|---|
++++ | ++++ | +++++ |
Sparse-Transfer Experiments quickly create a smaller and faster model for your dataset by transferring from a SparseZoo pre-sparsified foundational model, providing a 5-10x speedup with minimal accuracy loss, ideal for quick model optimization without retraining your model.
To run a Sparse-Transfer Experiment for your model (optional), dataset, and use case, run the following command:
sparsify.run sparse-transfer --use-case USE_CASE --model OPTIONAL_MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparse transfer a SparseZoo model to the Imagenette dataset for image classification, run the following command:
sparsify.run sparse-transfer --use-case image_classification --data imagenette --optim-level 0.5
Or, to sparse transfer a SparseZoo model to the SST2 dataset for sentiment analysis, run the following command:
sparsify.run sparse-transfer --use-case text_classification --data sst2 --optim-level 0.5
To dive deeper into Sparse-Transfer Experiments, read through the Sparse-Transfer Experiment Guide.
Note, Sparse-Transfer Experiments require the model to be saved in a PyTorch format corresponding to the underlying integration such as Ultralytics YOLOv5 or Hugging Face Transformers. Datasets must additionally match the expected format of the underlying integration. More details and exact formats are provided in the Sparse-Transfer Experiment Guide.Sparsity | Sparsification Speed | Accuracy |
---|---|---|
+++++ | ++ | +++++ |
Training-aware Experiments sparsify your model during training, providing a 6-12x speedup with minimal accuracy loss, ideal for thorough model optimization when the best performance and accuracy are required.
To run a Training-Aware Experiment for your model, dataset, and use case, run the following command:
sparsify.run training-aware --use-case USE_CASE --model OPTIONAL_MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparsify a ResNet-50 model on the Imagenette dataset for image classification, run the following command:
sparsify.run training-aware --use-case image_classification --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenette/base-none" --data imagenette --optim-level 0.5
Or, to sparsify a BERT model on the SST2 dataset for sentiment analysis, run the following command:
sparsify.run training-aware --use-case text_classification --model "zoo:nlp/sentiment_analysis/bert-base/pytorch/huggingface/sst2/base-none" --data sst2 --optim-level 0.5
To dive deeper into Training-Aware Experiments, read through the Training-Aware Experiment Guide.
Note that Training-Aware Experiments require the model to be saved in a PyTorch format corresponding to the underlying integration such as Ultralytics YOLOv5 or Hugging Face Transformers. Datasets must additionally match the expected format of the underlying integration. More details and exact formats are provided in the Training-Aware Experiment Guide.Once you have run your Experiment, the results, logs, and deployment files will be saved under the current working directory in the following format:
[EXPERIMENT_TYPE]_[USE_CASE]_{DATE_TIME}
├── deployment
│ ├── model.onnx
│ └── *supporting files*
├── logs
│ ├── *logs*
├── training_artifacts
│ ├── *training artifacts*
├── *metrics and results*
You can compare the accuracy by looking through the metrics printed out to the console and the metrics saved in the experiment directory. Additionally, you can use DeepSparse to compare the inference performance on your CPU deployment hardware.
Note: In the near future, you will be able to visualize the results in Sparsify Cloud, simulate other scenarios and hyperparameters, compare the results to other Experiments, and package for your deployment scenario.To run a benchmark on your deployment hardware, use the deepsparse.benchmark
command with your original model and the new optimized model.
This will run a number of inferences to simulate a real-world scenario and print out the results.
It's as simple as running the following command:
deepsparse.benchmark --model_path MODEL --scenario SCENARIO
For example, to benchmark a dense ResNet-50 model, run the following command:
deepsparse.benchmark --model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenette/base-none" --scenario sync
This can then be compared to the sparsified ResNet-50 model with the following command:
deepsparse.benchmark --model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" --scenario sync
The output will look similar to the following:
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230629 COMMUNITY | (fc8b788a) (release) (optimized) (system=avx512, binary=avx512)
deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: ./model.onnx
batch_size: 1
num_cores: 1
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 0.9981
cpu_avx_type: avx512
cpu_vnni: False
=Original Model Path: ./model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 134.5611
Latency Mean (ms/batch): 7.4217
Latency Median (ms/batch): 7.4245
Latency Std (ms/batch): 0.0264
Iterations: 1346
See the DeepSparse Benchmarking User Guide for more information on benchmarking.
As an optional step to this quickstart, now that you have your optimized model, you are ready for inferencing. To get the most inference performance out of your optimized model, we recommend you deploy on Neural Magic's DeepSparse. DeepSparse is built to get the best performance out of optimized models on CPUs.
DeepSparse Server takes in a task and a model path and will enable you to serve models and Pipelines
for deployment in HTTP.
You can deploy any ONNX model using DeepSparse Server with the following command:
deepsparse.server --task USE_CASE --model_path MODEL_PATH
Where USE_CASE
is the use case of your Experiment and MODEL_PATH
is the path to the deployment folder from the Experiment.
For example, to deploy a sparsified ResNet-50 model, run the following command:
deepsparse.server --task image_classification --model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none"
If you're not ready for deploying, congratulations on completing the quickstart!
- Sparsify Cloud User Guide
- Sparsify Datasets Guide
- Sparsify Models Guide
- One-Shot Experiments Guide
- Sparse-Transfer Experiments Guide
- Training-Aware Experiments Guide
Now that you have explored Sparsify [Alpha], here are other related resources.
Report UI issues and CLI errors, submit bug reports, and provide general feedback about the product to the Sparsify team via the Neural Magic Slack Channel, or via GitHub Issues. Alpha support is provided through those channels.
Sparsify Alpha is a pre-release version of Sparsify that is still in active development. The product is not yet ready for production use; APIs and UIs are subject to change. There may be bugs in the Alpha version, which we hope to have fixed before Beta and then a general Q3 2023 release. The feedback you provide on quality and usability helps us identify issues, fix them, and make Sparsify even better. This information is used internally by Neural Magic solely for that purpose. It is not shared or used in any other way.
That being said, we are excited to share this release and hear what you think. Thank you in advance for your feedback and interest!
Official builds are hosted on PyPI
- stable: sparsify
- nightly (dev): sparsify-nightly
Additionally, more information can be found via GitHub Releases.
The project is licensed under the Apache License Version 2.0.
We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.
For user help or questions about Sparsify, sign up or log in to our Neural Magic Community Slack. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.
You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.
For more general questions about Neural Magic, please fill out this form.
Find this project useful in your research or other communications? Please consider citing:
@InProceedings{
pmlr-v119-kurtz20a,
title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks},
author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan},
booktitle = {Proceedings of the 37th International Conference on Machine Learning},
pages = {5533--5543},
year = {2020},
editor = {Hal Daumé III and Aarti Singh},
volume = {119},
series = {Proceedings of Machine Learning Research},
address = {Virtual},
month = {13--18 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
url = {http://proceedings.mlr.press/v119/kurtz20a.html},
abstract = {Optimizing convolutional neural networks for fast inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation functions. In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains. To induce highly sparse activation maps without accuracy loss, we introduce a new regularization technique, coupled with a new threshold-based sparsification method based on a parameterized activation function called Forced-Activation-Threshold Rectified Linear Unit (FATReLU). We examine the impact of our methods on popular image classification models, showing that most architectures can adapt to significantly sparser activation maps without any accuracy loss. Our second contribution is showing that these these compression gains can be translated into inference speedups: we provide a new algorithm to enable fast convolution operations over networks with sparse activations, and show that it can enable significant speedups for end-to-end inference on a range of popular models on the large-scale ImageNet image classification task on modern Intel CPUs, with little or no retraining cost.}
}
@misc{
singh2020woodfisher,
title={WoodFisher: Efficient Second-Order Approximation for Neural Network Compression},
author={Sidak Pal Singh and Dan Alistarh},
year={2020},
eprint={2004.14340},
archivePrefix={arXiv},
primaryClass={cs.LG}
}