Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend #235

Closed
runningabcd opened this issue Jun 24, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@runningabcd
Copy link

runningabcd commented Jun 24, 2023

Description

When I train xtransformer with pecos model, a training error occurs

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']

  • This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    /usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
    warnings.warn(
    Constructed training corpus len=679174, training label matrix with shape=(679174, 679174) and nnz=1429299
    Constructed training feature matrix with shape=(679174, 1134376) and nnz=1195014
    training start >>>>>>>>
    Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
  • This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "/home/Extreme_Label_Classification/tfidf/train.py", line 38, in
    custom_xtf = XTransformer.train(prob)
    File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/model.py", line 447, in train
    res_dict = TransformerMatcher.train(
    File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1382, in train
    matcher.fine_tune_encoder(prob, val_prob=val_prob, val_csr_codes=val_csr_codes)
    File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1122, in fine_tune_encoder
    torch.nn.utils.clip_grad_norm_(
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 55, in clip_grad_norm_
    norms.extend(torch._foreach_norm(grads, norm_type))
    NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, CUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:31034 [kernel]
CUDA: registered at aten/src/ATen/RegisterCUDA.cpp:43986 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:280 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradHIP: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradVE: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMeta: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradMTIA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:17472 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:16726 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:487 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:354 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:815 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1073 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:210 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:152 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:487 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]

How to Reproduce?

train data like this:
4400,1580,5174 教育培训机构.道口财富是一家教育培训机构,由清控控股旗下公司联合上海陆家嘴旗下公司发起设立,为学员提供财富管理课程和创业金融课程。
5156,1188,1459 场景营销平台.北京蜂巢天下信息技术有限公司项目团队组建于2014年,总部位于北京,是基于Beacon网络的场景营销平台。专注于为本地生活服务商户提供基于场景的优惠分发,为用户提供一键接入身边优惠内容。
5156,1459 定制品在线设计及管理平台.时代定制是一个定制品在线设计及业务管理平台,主要服务于印刷和设计类企业、网站、影楼、文印店。

Steps to reproduce

from pecos.utils.featurization.text.preprocess import Preprocessor
from pecos.xmc.xtransformer.model import XTransformer
from pecos.xmc.xtransformer.module import MLProblemWithText

import os

parsed_result = Preprocessor.load_data_from_file(
    "./training-data.txt",
    "./output-labels.txt",
)
Y = parsed_result["label_matrix"]
X_txt = parsed_result["corpus"]

print(f"Constructed training corpus len={len(X_txt)}, training label matrix with shape={Y.shape} and nnz={Y.nnz}")

vectorizer_config = {
    "type": "tfidf",
    "kwargs": {
        "base_vect_configs": [
            {
                "ngram_range": [1, 2],
                "max_df_ratio": 0.98,
                "analyzer": "word",
            },
        ],
    },
}

tfidf_model = Preprocessor.train(X_txt, vectorizer_config)
X_feat = tfidf_model.predict(X_txt)

print(f"Constructed training feature matrix with shape={X_feat.shape} and nnz={X_feat.nnz}")

prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)
custom_xtf = XTransformer.train(prob)

custom_model_dir = "multi_labels_model_dir"
os.makedirs(custom_model_dir, exist_ok=True)

tfidf_model.save(f"{custom_model_dir}/tfidf_model")
custom_xtf.save(f"{custom_model_dir}/xrt_model")

# custom_xtf = XTransformer.load(f"{custom_model_dir}/xrt_model")
# tfidf_model = Preprocessor.load(f"{custom_model_dir}/tfidf_model")

Error message or code output

Traceback (most recent call last):
  File "/home/Extreme_Label_Classification/tfidf/train.py", line 38, in <module>
    custom_xtf = XTransformer.train(prob)
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/model.py", line 447, in train
    res_dict = TransformerMatcher.train(
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1382, in train
    matcher.fine_tune_encoder(prob, val_prob=val_prob, val_csr_codes=val_csr_codes)
  File "/usr/local/lib/python3.10/dist-packages/pecos/xmc/xtransformer/matcher.py", line 1122, in fine_tune_encoder
    torch.nn.utils.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 55, in clip_grad_norm_
    norms.extend(torch._foreach_norm(grads, norm_type))
NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, CUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

docker stats screenshot:

截屏2023-06-20 18 13 35

Environment

  • Operating system: Ubuntu 22.04(docker)
  • Python version:3.10
  • PECOS version:1.0.0
  • GPU version:NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7
@runningabcd runningabcd added the bug Something isn't working label Jun 24, 2023
@runningabcd
Copy link
Author

help

@runningabcd
Copy link
Author

The training data and classification labels are both in Chinese, and the training fails with X-Transformer。Does X-Transformer not support Chinese?

@runningabcd
Copy link
Author

The training data and classification labels are both in Chinese, and the training fails with X-Transformer。Does X-Transformer not support Chinese?

Should I need change bert to roberta or bert-chinese model?

@jiong-zhang
Copy link
Contributor

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

@runningabcd
Copy link
Author

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

Thank you very much, I will try it.

@runningabcd
Copy link
Author

Hi @runningabcd this is an known issue with torch 2.0 and is fixed in PR, it will be updated in the next release. Downgrading torch to below 2.0 is a temp fix.

It work!
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants