Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

runningabcd · 2023-06-24T02:50:07Z

Description

When I train xtransformer with pecos model, a training error occurs

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']

This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
Constructed training corpus len=679174, training label matrix with shape=(679174, 679174) and nnz=1429299
Constructed training feature matrix with shape=(679174, 1134376) and nnz=1195014
training start >>>>>>>>
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
File "/home/Extreme_Label_Classification/tfidf/train.py", line 38, in
custom_xtf = XTransformer.train(prob)
File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1382, in train
matcher.fine_tune_encoder(prob, val_prob=val_prob, val_csr_codes=val_csr_codes)
File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1122, in fine_tune_encoder
torch.nn.utils.clip_grad_norm_(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 55, in clip_grad_norm_
norms.extend(torch._foreach_norm(grads, norm_type))
NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, CUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:31034 [kernel]
CUDA: registered at aten/src/ATen/RegisterCUDA.cpp:43986 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:280 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradHIP: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradVE: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMeta: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMTIA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:16726 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:487 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:354 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:815 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1073 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:210 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:152 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:487 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]

How to Reproduce?

train data like this:
4400,1580,5174 教育培训机构.道口财富是一家教育培训机构，由清控控股旗下公司联合上海陆家嘴旗下公司发起设立，为学员提供财富管理课程和创业金融课程。
5156,1188,1459 场景营销平台.北京蜂巢天下信息技术有限公司项目团队组建于2014年,总部位于北京，是基于Beacon网络的场景营销平台。专注于为本地生活服务商户提供基于场景的优惠分发，为用户提供一键接入身边优惠内容。
5156,1459 定制品在线设计及管理平台.时代定制是一个定制品在线设计及业务管理平台，主要服务于印刷和设计类企业、网站、影楼、文印店。

Steps to reproduce

from pecos.utils.featurization.text.preprocess import Preprocessor
from pecos.xmc.xtransformer.model import XTransformer
from pecos.xmc.xtransformer.module import MLProblemWithText

import os

parsed_result = Preprocessor.load_data_from_file(
    "./training-data.txt",
    "./output-labels.txt",
)
Y = parsed_result["label_matrix"]
X_txt = parsed_result["corpus"]

print(f"Constructed training corpus len={len(X_txt)}, training label matrix with shape={Y.shape} and nnz={Y.nnz}")

vectorizer_config = {
    "type": "tfidf",
    "kwargs": {
        "base_vect_configs": [
            {
                "ngram_range": [1, 2],
                "max_df_ratio": 0.98,
                "analyzer": "word",
            },
        ],
    },
}

tfidf_model = Preprocessor.train(X_txt, vectorizer_config)
X_feat = tfidf_model.predict(X_txt)

print(f"Constructed training feature matrix with shape={X_feat.shape} and nnz={X_feat.nnz}")

prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)
custom_xtf = XTransformer.train(prob)

custom_model_dir = "multi_labels_model_dir"
os.makedirs(custom_model_dir, exist_ok=True)

tfidf_model.save(f"{custom_model_dir}/tfidf_model")
custom_xtf.save(f"{custom_model_dir}/xrt_model")

# custom_xtf = XTransformer.load(f"{custom_model_dir}/xrt_model")
# tfidf_model = Preprocessor.load(f"{custom_model_dir}/tfidf_model")

Error message or code output

Traceback (most recent call last):
  File "/home/Extreme_Label_Classification/tfidf/train.py", line 38, in <module>
    custom_xtf = XTransformer.train(prob)
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/model.py", line 447, in train
    res_dict = TransformerMatcher.train(
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1382, in train
    matcher.fine_tune_encoder(prob, val_prob=val_prob, val_csr_codes=val_csr_codes)
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1122, in fine_tune_encoder
    torch.nn.utils.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 55, in clip_grad_norm_
    norms.extend(torch._foreach_norm(grads, norm_type))
NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, CUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

docker stats screenshot:

Environment

Operating system: Ubuntu 22.04(docker)
Python version:3.10
PECOS version:1.0.0
GPU version：NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7

The text was updated successfully, but these errors were encountered:

runningabcd · 2023-06-24T02:55:39Z

help

runningabcd · 2023-06-25T06:14:46Z

The training data and classification labels are both in Chinese, and the training fails with X-Transformer。Does X-Transformer not support Chinese?

runningabcd · 2023-06-25T14:21:47Z

The training data and classification labels are both in Chinese, and the training fails with X-Transformer。Does X-Transformer not support Chinese?

Should I need change bert to roberta or bert-chinese model？

jiong-zhang · 2023-06-26T00:36:24Z

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

runningabcd · 2023-06-27T02:04:50Z

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

Thank you very much, I will try it.

runningabcd · 2023-06-27T06:28:17Z

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

It work！
Thanks again.

runningabcd added the bug Something isn't working label Jun 24, 2023

runningabcd closed this as completed Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

runningabcd commented Jun 24, 2023 •

edited

Loading

runningabcd commented Jun 24, 2023

runningabcd commented Jun 25, 2023

runningabcd commented Jun 25, 2023

jiong-zhang commented Jun 26, 2023

runningabcd commented Jun 27, 2023

runningabcd commented Jun 27, 2023

Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

Comments

runningabcd commented Jun 24, 2023 • edited Loading

Description

How to Reproduce?

Steps to reproduce

Error message or code output

Environment

runningabcd commented Jun 24, 2023

runningabcd commented Jun 25, 2023

runningabcd commented Jun 25, 2023

jiong-zhang commented Jun 26, 2023

runningabcd commented Jun 27, 2023

runningabcd commented Jun 27, 2023

runningabcd commented Jun 24, 2023 •

edited

Loading