Skip to content

Add support of tp_size != WORLD_SIZE to bloom#428

Merged
regisss merged 2 commits into
huggingface:mainfrom
NirSonnenschein:inference_tp_size_issue
Oct 23, 2023
Merged

Add support of tp_size != WORLD_SIZE to bloom#428
regisss merged 2 commits into
huggingface:mainfrom
NirSonnenschein:inference_tp_size_issue

Conversation

@NirSonnenschein
Copy link
Copy Markdown
Contributor

Add suport for cases where tp_size != world size
we encountered a case during develoment of DS chat step 3 where the tp size for inference needs to be different than the world size. The code in modeling_bloom.py assumes that tp size is always equal to world size in inference and this change allows to configure it differently.

What does this PR do?

Add suport for cases where tp_size != world size
we encountered a case during develoment of DS chat step 3 where the tp size for inference needs to be different than the world size. The code in modeling_bloom.py assumes that tp size is always equal to world size in inference and this change allows to configure it differently.

@NirSonnenschein NirSonnenschein requested a review from a user September 27, 2023 07:32
@vivekgoe vivekgoe added the run-test Run CI for PRs from external contributors label Sep 27, 2023
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting use case, does it bring any speedup to use a TP size different from the number of devices?

I left a couple of comments because we should not manage this in adapt_transformers_to_gaudi.

Besides, I guess we would need to modify this line in the text-generation example right?

ds_inference_kwargs["tensor_parallel"] = {"tp_size": world_size}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not manage it in this file but rather in modeling_bloom.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue only in bloom ? I guess it will also be for LLaMA . Should we do generic fix or a pointed ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have it defined in the modeling of all the models that are compatible with DeepSpeed-inference. It may be a bit redundant but that way we'll have better control over the scope of these changes.



class GaudiBloomForCausalLM(BloomForCausalLM):
inference_tp_size = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use a TP_SIZE env variable than adding a class attribute here

@NirSonnenschein
Copy link
Copy Markdown
Contributor Author

Interesting use case, does it bring any speedup to use a TP size different from the number of devices?

I left a couple of comments because we should not manage this in adapt_transformers_to_gaudi.

Besides, I guess we would need to modify this line in the text-generation example right?

ds_inference_kwargs["tensor_parallel"] = {"tp_size": world_size}

Thanks @regisss. To clarify: we are using optimum-habana to improve performance (which it does) but the issue here is a functional one and not performance related. when using the code as is we see tensor size mismatches which result in errors when running using more than one card. the DS chat code doesn't expect this generation inference to run in tensor parallel (the DS chat model combines both inference and training of different models as part of its run).

regarding the line from text generation, I'm not sure as we did not modify it as part of our attempts to fix our issue, but this general assumption may not be true in all cases.
ds_inference_kwargs["tensor_parallel"] = {"tp_size": world_size}

regarding the current design to set the value using the adapt_transformers_to_gaudi,
these are some of the considerations raised in our internal discussions:

  1. we view the general assumption that is in the code which assumes tp_size = WORLD_SIZE is not always correct (particularly not in our case). however, this way of working is currently used by the existing working flows, and we didn't want to change this, just add support for our use case.

  2. our current workaround for this issue does use env variables, this has some innate downsides and a big upside:
    a. they are process global and there may be a flow where both the new scenario and the classic are both required which could
    complicate things in future.
    b. they can (and often are) set in an entirely different frame as the code reading them which can lead to weird issues if they
    are changed mid run or in a place not expected (e.g. threads).
    c. adding a global env variable like this sets up a new external configuration and needs to be very specific or it will be expected
    to affect other parts of Habana optimum / other models which we had no intention of modifying (e.g. if a TP_SIZE env
    variable was supported it might be expected to allow changing the TP size in other cases which is not our intention).

upside: no APIs are changed, and changes can be made in modeling_bloom only making it the most focused solution.

  1. adapt_transformers_to_gaudi is used as it is the main setup API called and is a "synchronous" way to configure this option without altering the current flows (we want all flows other than the special case to run as normal). once stored it can be used by the affected flows. this allows for a slightly more synchronous approach which is still fairly contained to the area that needs the change.
    we originally considered propagating this parameter normally through the function calls, but this would add it to some functions along the way to the alibi generation function where it may not make sense, and would require support for it in other uses cases which may complicate the code. hence the attempt to make the fix as focused as possible.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Sep 28, 2023

@NirSonnenschein Ah okay, I understand better the use case!

Thanks for the summary of the pros and cons for both solutions. I mostly agree with you, except that I don't think adapt_transformers_to_gaudi is the place to enable this. Would overriding the __init__ method of BloomForCausalLM in GaudiBloomForCausalLM be a good solution for you? Something like:

def __init__(self, config: BloomConfig, inference_tp_size: int=None):
    super().__init__(config)
    self.inference_tp_size = inference_tp_size

Then, you would just need to give the argument inference_tp_size=N to the from_pretrained method when initializing the model. WDYT?

@NirSonnenschein NirSonnenschein force-pushed the inference_tp_size_issue branch from 32324d0 to 404257e Compare October 18, 2023 08:13
Add suport for cases where tp_size != world size
we encountered a case during develoment of DS chat step 3
where the tp size for inference needs to be different than
the world size. The  code in modeling_bloom.py assumes
that tp size is always equal to world size in inference and this
change allows to configure it differently.
@NirSonnenschein NirSonnenschein force-pushed the inference_tp_size_issue branch from 404257e to f5690e0 Compare October 18, 2023 08:17
@NirSonnenschein
Copy link
Copy Markdown
Contributor Author

@regisss
sorry for the delay, updated branch with the following changes:

  1. configuration is no longer done through adapt_transformers_to_gaudi and it is not modified
  2. value set via GaudiBloomForCausalLM directly
  3. enforced tp_size can be set to either world size or 1 to remove edge cases which may not be supported correctly

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just spotted a typo in a method name, let's correct it and then it will be good to merge :)

Comment thread optimum/habana/transformers/models/bloom/modeling_bloom.py Outdated
@NirSonnenschein
Copy link
Copy Markdown
Contributor Author

thanks @regisss,
hopefully this can now move forward.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Oct 23, 2023

@NirSonnenschein Can you also run the style formatter from the root of the repo as follows please?

pip install --upgrade black ruff
make style

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
@NirSonnenschein NirSonnenschein force-pushed the inference_tp_size_issue branch from 0f45d7f to 636f1a8 Compare October 23, 2023 16:10
@NirSonnenschein
Copy link
Copy Markdown
Contributor Author

@regisss thanks, it wanted a space after the comma, fixed.

@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Oct 23, 2023
@regisss regisss merged commit 64b1499 into huggingface:main Oct 23, 2023
vivekgoe pushed a commit to vivekgoe/optimum-habana that referenced this pull request Oct 25, 2023
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
Co-authored-by: tianyuan211 <77777807+tianyuan211@users.noreply.github.com>
Co-authored-by: Yuan Tian <tian.yuan@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants