Skip to content

Text-Generation Pipeline Example#526

Merged
regisss merged 13 commits into
huggingface:mainfrom
sjagtap1803:textgen_pipeline
Jan 4, 2024
Merged

Text-Generation Pipeline Example#526
regisss merged 13 commits into
huggingface:mainfrom
sjagtap1803:textgen_pipeline

Conversation

@sjagtap1803
Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a custom text-generation pipeline to examples/text-generation. This pipeline supports single-card as well as multi-card runs (DeepSpeed).

Please let me know if additional scripts/documentation are required.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@sjagtap1803 sjagtap1803 requested a review from regisss as a code owner November 14, 2023 20:04
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!
I left several comments.
We should also move the two test files to the tests folder. And we could create a text-generation subfolder there with these two files, test_text_generation_example.py and test_encoder_decoder_text_summarization.py.

Comment on lines +18 to +145
def get_repo_root(model_name_or_path, local_rank=-1, token=None):
"""
Downloads the specified model checkpoint and returns the repository where it was downloaded.
"""
if Path(model_name_or_path).is_dir():
# If it is a local model, no need to download anything
return model_name_or_path
else:
# Checks if online or not
if is_offline_mode():
if local_rank == 0:
print("Offline mode: forcing local_files_only=True")

# Only download PyTorch weights by default
allow_patterns = ["*.bin"]

# Download only on first process
if local_rank in [-1, 0]:
cache_dir = snapshot_download(
model_name_or_path,
local_files_only=is_offline_mode(),
cache_dir=os.getenv("TRANSFORMERS_CACHE", None),
allow_patterns=allow_patterns,
max_workers=16,
token=token,
)
if local_rank == -1:
# If there is only one process, then the method is finished
return cache_dir

# Make all processes wait so that other processes can get the checkpoint directly from cache
torch.distributed.barrier()

return snapshot_download(
model_name_or_path,
local_files_only=is_offline_mode(),
cache_dir=os.getenv("TRANSFORMERS_CACHE", None),
allow_patterns=allow_patterns,
token=token,
)


def get_checkpoint_files(model_name_or_path, local_rank):
"""
Gets the list of files for the specified model checkpoint.
"""
cached_repo_dir = get_repo_root(model_name_or_path, local_rank)

# Extensions: .bin | .pt
# Creates a list of paths from all downloaded files in cache dir
file_list = [str(entry) for entry in Path(cached_repo_dir).rglob("*.[bp][it][n]") if entry.is_file()]
return file_list


def write_checkpoints_json(model_name_or_path, local_rank, checkpoints_json):
"""
Dumps metadata into a JSON file for DeepSpeed-inference.
"""
checkpoint_files = get_checkpoint_files(model_name_or_path, local_rank)
if local_rank == 0:
data = {"type": "ds_model", "checkpoints": checkpoint_files, "version": 1.0}
with open(checkpoints_json, "w") as fp:
json.dump(data, fp)


def model_on_meta(config):
"""
Checks if load the model to meta.
"""
return config.model_type in ["bloom", "llama"]


def get_optimized_model_name(config):
from optimum.habana.transformers.generation import MODELS_OPTIMIZED_WITH_STATIC_SHAPES

for model_type in MODELS_OPTIMIZED_WITH_STATIC_SHAPES:
if model_type == config.model_type:
return model_type

return None


def model_is_optimized(config):
"""
Checks if the given config belongs to a model in optimum/habana/transformers/models, which has a
new input token_idx.
"""
return get_optimized_model_name(config) is not None


def get_ds_injection_policy(config):
"""
Defines injection policies for model parallelism via DeepSpeed.
"""
model_type = get_optimized_model_name(config)
policy = {}
if model_type:
if model_type == "bloom":
from transformers.models.bloom.modeling_bloom import BloomBlock

policy = {BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

if model_type == "opt":
from transformers.models.opt.modeling_opt import OPTDecoderLayer

policy = {OPTDecoderLayer: ("self_attn.out_proj", ".fc2")}

if model_type == "gpt2":
from transformers.models.gpt2.modeling_gpt2 import GPT2MLP

policy = {GPT2MLP: ("attn.c_proj", "mlp.c_proj")}

if model_type == "gptj":
from transformers.models.gptj.modeling_gptj import GPTJBlock

policy = {GPTJBlock: ("attn.out_proj", "mlp.fc_out")}

if model_type == "gpt_neox":
from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer

policy = {GPTNeoXLayer: ("attention.dense", "mlp.dense_4h_to_h")}

if model_type == "llama":
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

policy = {LlamaDecoderLayer: ("self_attn.o_proj", "mlp.down_proj")}

return policy
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has been made.

self.generation_config.force_words_ids = None

# Define stopping criteria based on eos token id
self.stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria(self.generation_config.eos_token_id)])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this custom stopping criteria, why not setting ignore_eos to False in the generation config?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the stopping criteria has been removed.

return self.stop_token_id in input_ids[0]


class GaudiTextGenerationPipeline:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has been made.


self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

model_dtype = torch.bfloat16
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to specify the dtype as an argument

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide an example of this argument? Wondering if we could use a boolean for bf16 or a string for different dtypes...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this link. I added a use_bf16 argument to the pipeline constructor.

else:
get_repo_root(model_name_or_path, local_rank=self.local_rank)
# placement on hpu if meta tensors are not supported
with deepspeed.OnDevice(dtype=model_dtype, device="hpu"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that with device="hpu" models that cannot be loaded on Meta may trigger an out-of-memory error if they are spread across several devices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I have changed it to device="cpu".

Comment thread examples/text-generation/textgen_pipeline/pipeline.py Outdated
return cache_dir

# Make all processes wait so that other processes can get the checkpoint directly from cache
torch.distributed.barrier()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must fail when not using DeepSpeed no? Since torch.distributed is not initialized in that case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable anymore as methods are imported from checkpoint_utils.py


# Used for padding input to fixed length
self.tokenizer.padding_side = "left"
self.max_padding_length = kwargs.get("max_padding_length", self.model.config.max_position_embeddings)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.max_padding_length = kwargs.get("max_padding_length", self.model.config.max_position_embeddings)
self.max_padding_length = kwargs.get("max_padding_length", self.tokenizer.model_max_length)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried your suggestion but the tokenizer threw an error.

Function to compile computation graphs and synchronize hpus.
"""
for _ in range(3):
self("Here is my prompt")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will still get compilations if the prompt doesn't have exactly the same size as this toy one no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. The input prompt is left-padded so the computation graph does not change.

@classmethod
def setUpClass(self):
"""Overrides setUpClass from unittest to create artifacts for testing"""
self.base_command = ["python", "../../gaudi_spawn.py", "--use_deepspeed", "--world_size"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"../.." will depend on where the test is launched from. I would prefer to have an absolute path here. You can for example use pathlib as follows:

Path(__file__).parent.parent.resolve()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this next.

@sjagtap1803
Copy link
Copy Markdown
Contributor Author

sjagtap1803 commented Nov 15, 2023

Thanks for reviewing my code. I have replied to all your comments.
Regarding the tests directory, should I create a new directory text-generation, move my test files there and make other required changes? I believe the absolute paths in test_text_generation_example.py and test_encoder_decoder_text_summarization.py will also have to be changed.

@libinta libinta added the run-test Run CI for PRs from external contributors label Dec 5, 2023
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments. The main one is about using the methods defined in https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/utils.py to avoid duplicated code with the text-generation example.
I think the main method of the example is a good base to start with:

An end-to-end text-generation pipeline that can used to initialize LangChain classes. It supports both single-hpu and multi-hpu inference.
"""

def __init__(self, model_name_or_path=None, use_bf16=True, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, model_name_or_path=None, use_bf16=True, **kwargs):
def __init__(self, model_name_or_path=None, bf16=True, **kwargs):

Let's call this arg bf16 to stay consistent with other parts of the codebase

Comment on lines +32 to +137
self.use_deepspeed = "deepspeed" in os.environ["_"]

if self.use_deepspeed:
world_size, _, self.local_rank = initialize_distributed_hpu()

import deepspeed

# Initialize Deepspeed processes
deepspeed.init_distributed(dist_backend="hccl")

self.task = "text-generation"
self.device = "hpu"

# Tweak generation so that it runs faster on Gaudi
adapt_transformers_to_gaudi()
set_seed(27)

self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

if self.use_deepspeed or use_bf16:
model_dtype = torch.bfloat16
else:
model_dtype = torch.float

if self.use_deepspeed:
config = AutoConfig.from_pretrained(model_name_or_path)
is_optimized = model_is_optimized(config)
load_to_meta = model_on_meta(config)

if load_to_meta:
# Construct model with fake meta tensors, later will be replaced on devices during ds-inference ckpt load
with deepspeed.OnDevice(dtype=model_dtype, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=model_dtype)
else:
get_repo_root(model_name_or_path, local_rank=self.local_rank)
# placement on cpu if meta tensors are not supported
with deepspeed.OnDevice(dtype=model_dtype, device="cpu"):
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=model_dtype)
model = model.eval()

# Initialize the model
ds_inference_kwargs = {"dtype": model_dtype}
ds_inference_kwargs["tensor_parallel"] = {"tp_size": world_size}
ds_inference_kwargs["enable_cuda_graph"] = True

if load_to_meta:
# model loaded to meta is managed differently
checkpoints_json = "checkpoints.json"
write_checkpoints_json(model_name_or_path, self.local_rank, checkpoints_json)

# Make sure all devices/nodes have access to the model checkpoints
torch.distributed.barrier()

ds_inference_kwargs["injection_policy"] = get_ds_injection_policy(config)
if load_to_meta:
ds_inference_kwargs["checkpoint"] = checkpoints_json

model = deepspeed.init_inference(model, **ds_inference_kwargs)
model = model.module
else:
get_repo_root(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=model_dtype)
model = model.eval().to(self.device)
is_optimized = model_is_optimized(model.config)
model = wrap_in_hpu_graph(model)

self.model = model

# Used for padding input to fixed length
self.tokenizer.padding_side = "left"
self.max_padding_length = kwargs.get("max_padding_length", self.model.config.max_position_embeddings)

# Define config params for llama models
if self.model.config.model_type == "llama":
self.model.generation_config.pad_token_id = 0
self.model.generation_config.bos_token_id = 1
self.model.generation_config.eos_token_id = 2
self.tokenizer.bos_token_id = self.model.generation_config.bos_token_id
self.tokenizer.eos_token_id = self.model.generation_config.eos_token_id
self.tokenizer.pad_token_id = self.model.generation_config.pad_token_id
self.tokenizer.pad_token = self.tokenizer.decode(self.tokenizer.pad_token_id)
self.tokenizer.eos_token = self.tokenizer.decode(self.tokenizer.eos_token_id)
self.tokenizer.bos_token = self.tokenizer.decode(self.tokenizer.bos_token_id)

# Applicable to models that do not have pad tokens
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.model.generation_config.pad_token_id = self.model.generation_config.eos_token_id

# Edit generation configuration based on input arguments
self.generation_config = copy.deepcopy(self.model.generation_config)
self.generation_config.max_new_tokens = kwargs.get("max_new_tokens", 100)
self.generation_config.use_cache = kwargs.get("use_kv_cache", True)
self.generation_config.static_shapes = is_optimized
self.generation_config.do_sample = kwargs.get("do_sample", False)
self.generation_config.num_beams = kwargs.get("num_beams", 1)
self.generation_config.temperature = kwargs.get("temperature", 1.0)
self.generation_config.top_p = kwargs.get("top_p", 1.0)
self.generation_config.repetition_penalty = kwargs.get("repetition_penalty", 1.0)
self.generation_config.num_return_sequences = kwargs.get("num_return_sequences", 1)
self.generation_config.bad_words_ids = None
self.generation_config.force_words_ids = None
self.generation_config.ignore_eos = False

if self.use_deepspeed:
torch.distributed.barrier()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since run_generation.py was recently refactored, let's try to use here the methods defined in https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/utils.py.
That way we avoid duplicated code.

@sjagtap1803
Copy link
Copy Markdown
Contributor Author

I left a couple of comments. The main one is about using the methods defined in https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/utils.py to avoid duplicated code with the text-generation example. I think the main method of the example is a good base to start with:

I refactored the pipeline scripts to incorporate code from https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/utils.py wherever possible. Please let me know if you have any questions, comments or suggestions.

Copy link
Copy Markdown
Collaborator

@libinta libinta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a README.md with examples?

@sjagtap1803
Copy link
Copy Markdown
Contributor Author

can you add a README.md with examples?

Yes, I added a README with instructions and example commands.

@sjagtap1803 sjagtap1803 requested a review from libinta December 19, 2023 22:31
Comment thread examples/text-generation/textgen_pipeline/pipeline.py
Comment thread examples/text-generation/textgen_pipeline/pipeline.py
Comment thread examples/text-generation/textgen_pipeline/pipeline.py
@sjagtap1803 sjagtap1803 requested a review from ssarkar2 December 20, 2023 15:25
@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Dec 22, 2023
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread examples/text-generation/textgen_pipeline/README.md
Comment thread examples/text-generation/textgen_pipeline/README.md
Comment thread examples/text-generation/textgen_pipeline/pipeline.py Outdated
Comment thread examples/text-generation/textgen_pipeline/run_pipeline.py Outdated
@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 4, 2024
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I just spotted one typo and I'll merge after it is corrected 🙂

Comment thread examples/text-generation/text-generation-pipeline/README.md Outdated
@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 4, 2024
@regisss regisss merged commit e419599 into huggingface:main Jan 4, 2024
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants