Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the Bagua integration #19445

Merged
merged 2 commits into from
Feb 12, 2024
Merged

Remove the Bagua integration #19445

merged 2 commits into from
Feb 12, 2024

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Feb 11, 2024

What does this PR do?

Reasons for the removal:

  • The external integration repo was archived and is unmaintained since 6+ months
  • The bagua package itself is unmaintained since March 2023.
  • There are no wheels available for recent CUDA versions. The latest was 11.7, and the latest supported by the lightning-bagua integretion was 11.6.
  • I installed bagua-cuda116, lightning 2.0, latest lightning-bagua, pytorch 2.0 with CUDA 11.6, and ran the boring model with Trainer(strategy="bagua", accelerator="gpu", devices=4) which results in a connection error:
raceback (most recent call last):
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
    return function(*args, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
    self.strategy.setup(self)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 226, in setup
    self._configure_bagua_model(trainer)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 238, in _configure_bagua_model
    self.model = self._setup_model(model)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 250, in _setup_model
    return BaguaDistributedDataParallel(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/distributed.py", line 148, in __init__
    self.inner = BaguaDistributedDataParallel(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 153, in __init__
    self._bagua_init_algorithm()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 400, in _bagua_init_algorithm
    self._bagua_autotune_register_tensors()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 368, in _bagua_autotune_register_tensors
    rsp = self._bagua_autotune_client.register_tensors(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_service.py", line 318, in wrap
    raise e
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_service.py", line 314, in wrap
    result = request_func(*args, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_service.py", line 378, in register_tensors
    rsp = self.session.post(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=60785): Max retries exceeded with url: /api/v1/register_tensors (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f61b569ae90>: Failed to establish a new connection: [Errno 111] Connection refused'))

With single-gpu operation, you run into a compatibility issue with numpy:

Traceback (most recent call last):
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/flask/app.py", line 1463, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/flask/app.py", line 872, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/flask/app.py", line 870, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/flask/app.py", line 855, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_service.py", line 164, in register_tensors
    self.model_dict[model_name] = AutotuneServiceTaskManager(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_service.py", line 39, in __init__
    self.inner = AutotuneTaskManager(task_name, is_output_autotune_log)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/autotune_task_manager.py", line 48, in __init__
    self.bayesian_optimizer = BayesianOptimizer(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/service/bayesian_optimizer.py", line 51, in __init__
    self.bayesian_optimizer = skopt.Optimizer(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/optimizer/optimizer.py", line 279, in __init__
    self._initial_samples = self._initial_point_generator.generate(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/sampler/halton.py", line 102, in generate
    out = space.inverse_transform(np.transpose(out))
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/space/space.py", line 999, in inverse_transform
    columns.append(dim.inverse_transform(Xt[:, start]))
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/space/space.py", line 528, in inverse_transform
    inv_transform = super(Integer, self).inverse_transform(Xt)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/space/space.py", line 168, in inverse_transform
    return self.transformer.inverse_transform(Xt)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/space/transformers.py", line 309, in inverse_transform
    X = transformer.inverse_transform(X)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/skopt/space/transformers.py", line 275, in inverse_transform
    return np.round(X_orig).astype(np.int)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/numpy/__init__.py", line 324, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'inf'?
Traceback (most recent call last):
  File "/home/adrian/repositories/lightning/examples/pytorch/bug_report/bug_report_model.py", line 68, in <module>
    run()
  File "/home/adrian/repositories/lightning/examples/pytorch/bug_report/bug_report_model.py", line 63, in run
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
    call._call_and_handle_interrupt(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
    return function(*args, **kwargs)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
    self.strategy.setup(self)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 226, in setup
    self._configure_bagua_model(trainer)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 238, in _configure_bagua_model
    self.model = self._setup_model(model)
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/lightning_bagua/strategy.py", line 250, in _setup_model
    return BaguaDistributedDataParallel(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/distributed.py", line 148, in __init__
    self.inner = BaguaDistributedDataParallel(
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 153, in __init__
    self._bagua_init_algorithm()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 400, in _bagua_init_algorithm
    self._bagua_autotune_register_tensors()
  File "/home/adrian/.conda/envs/lightning-bagua/lib/python3.10/site-packages/bagua/torch_api/data_parallel/bagua_distributed.py", line 372, in _bagua_autotune_register_tensors
    assert rsp.status_code == 200, "Unexpected rsp={}".format(rsp)
AssertionError: Unexpected rsp=<Response [500]>

This demonstrates that clearly, the integration is no longer usable today, even with older versions of packages.


📚 Documentation preview 📚: https://pytorch-lightning--19445.org.readthedocs.build/en/19445/

cc @Borda @carmocca

@github-actions github-actions bot added ci Continuous Integration pl Generic label for PyTorch Lightning package dependencies Pull requests that update a dependency file labels Feb 11, 2024
Copy link
Contributor

github-actions bot commented Feb 11, 2024

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.13, oldest) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.1) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1) success
pl-cpu (windows-2022, lightning, 3.8, 1.13, oldest) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.1) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (macOS-12, pytorch, 3.11, 2.1) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.1) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) (testing Lightning | latest) success
pytorch-lightning (GPUs) (testing PyTorch | latest) success

These checks are required after the changes to .azure/gpu-tests-pytorch.yml, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to requirements/_integrations/strategies.txt, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, src/lightning/pytorch/utilities/imports.py, requirements/_integrations/strategies.txt.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli added this to the 2.3 milestone Feb 11, 2024
Copy link

codecov bot commented Feb 11, 2024

Codecov Report

Merging #19445 (11cdffc) into master (47c8f4c) will decrease coverage by 34%.
The diff coverage is n/a.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19445      +/-   ##
==========================================
- Coverage      82%      48%     -34%     
==========================================
  Files         452      444       -8     
  Lines       38118    37954     -164     
==========================================
- Hits        31258    18385   -12873     
- Misses       6860    19569   +12709     

@mergify mergify bot added the ready PRs ready to be merged label Feb 11, 2024
@carmocca carmocca merged commit 8d4768f into master Feb 12, 2024
90 checks passed
@carmocca carmocca deleted the remove/bagua branch February 12, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration dependencies Pull requests that update a dependency file pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants