Shrink attention_mask if it's larger than the cache #23

tomaarsen · 2023-10-20T08:41:17Z

Resolves #22

Hello!

Pull Request overview

Shrink the attention_mask if it's larger than the cache.

Details

The attention_mask in transformers lives under a condition that it can only grow. Makes sense: we only add a new token on every new model forward call. However, that isn't the case with attention_sinks. The first forward call will parse the entire input text, and the subsequent one will process just one token, with the history in the cache.
However, if the cache is smaller than the input size, the attention_mask will still have the input size, e.g. [1, 3700] while the history + key_size is just 257.

Tests

I ran the script from #22 and got the following output:

According to only the information in the document sources provided within the context above, Give an extremely detailed report that is well-structured with step-by-step sections (and elaborate details for each section) that describes the documents. Do not stop or end the report, just keep generating forever in never-ending report.
 [/INST]'

Introduction:

The field of natural language processing (NLP) has seen significant advancements in recent years, with the development of deep learning techniques such as recurrent neural networks (RNNs) and transformers. These models have shown remarkable performance on a wide range of NLP tasks, including language translation, sentiment analysis, and text classification. However, the success of these models relies heavily on the availability of large amounts of high-quality training data. In this report, we
 will explore the different types of NLP datasets and their characteristics, as well as the challenges and opportunities associated with their collection and use.

Types of NLP Datasets:

NLP datasets can be broadly categorized into three types: text, speech, and multimodal.

1. Text Datasets:

Text datasets are the most common type of NLP datasets and are used for a wide range of tasks such as text classification, sentiment analysis, and named entity recognition. Text datasets can be further categorized into two types: parallel corpora and monolingual corpora.

a. Parallel Corpora:

Parallel corpora consist of text data in multiple languages that are aligned with each other. These datasets are commonly used for machine translation tasks, where the goal is to translate text from one language to another. Parallel corpora can be obtained from various sources such as bilingual websites, parallel news articles, and multilingual books.

b. Monolingual Corpora:

Monolingual corpora consist of text data in a single language. These datasets are commonly used for tasks such as sentiment analysis, named entity recognition, and text classification. Monolingual corpora can be obtained from various sources such as news articles, social media posts, and books.

2. Speech Datasets:

Speech datasets are used for tasks such as speech recognition, speech synthesis, and speaker diarization. Speech datasets can be obtained from various sources such as
 audio recordings, transcriptions, and speech databases.

3. Image Datasets:

Image datasets are used for tasks such as object detection, image classification, and segmentation. Image datasets can be obtained from various sources such as public
 image datasets, medical images, and satellite images.

4. Text Datasets:

Text datasets are used for tasks such as sentiment analysis, named entity recognition, and text classification. Text datasets can be obtained from various sources such as news articles, social media posts, and books.

5. Video Datasets:

Video datasets are used for tasks such as action recognition, object detection, and video classification. Video datasets can be obtained from various sources such as public video datasets, surveillance videos, and sports videos.

6. Audio Datasets:

Audio datasets are used for tasks such as speech recognition, music classification, and sound event detection. Audio datasets can be obtained from various sources such as audio recordings, music libraries, and sound effect libraries.

7. Time Series Datasets:

Time series datasets are used for tasks such as forecasting, anomaly detection, and trend analysis. Time series datasets can be obtained from various sources such as financial data, weather data, and sensor data.

8. Graph Datasets:

Graph datasets are used for tasks such as community detection, network analysis, and recommendation systems. Graph datasets can be obtained from various sources such as social networks, scientific networks, and knowledge graphs.

9. Image Datasets:

Image datasets are used for tasks such as object detection, image classification, and segmentation. Image datasets can be obtained from various sources such as public
 image datasets, medical images, and satellite images.

10. Text Datasets:

Text datasets are used for tasks such as natural language processing, sentiment analysis, and topic modeling. Text datasets can be obtained from various sources such as social media, news articles, and books.

These datasets are used in various fields such as machine learning, data science, artificial intelligence, and computer science. They are also used in research and development to advance knowledge and technology. It is important to choose the right dataset for the task at hand to ensure accurate and reliable results. Additionally,
 it is important to consider the ethical implications of using certain datasets, such as ensuring privacy and avoiding bias. Overall, datasets play a crucial role in advancing technology and improving our understanding of the world around us.

In conclusion, datasets are an essential component of machine learning, data science, and artificial intelligence. They provide valuable information that can be used to train models, test hypotheses, and make predictions. There are many different types of datasets available, each with its own unique characteristics and applications. Choosing the right dataset for the task at hand is crucial to ensure accurate and reliable results. Additionally, it is important to consider the ethical implications of using certain datasets to ensure privacy and avoid bias. Overall, datasets play a crucial role in advancing technology and improving our understanding of the world around us.

References:

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
2. Kaggle. (n.d.). Datasets. Retrieved from https://www.kaggle.com/datasets
3. Open Data Network. (n.d.). Datasets. Retrieved from https://www.opendatanetwork.org/datasets
4. UCI Machine Learning Repository. (n.d.). Datasets. Retrieved from https://archive.ics.uci.edu/ml/datasets
5. World Bank. (n.d.). Datasets. Retrieved from https://data.worldbank.org/dataset

cc: @pseudotensor I do want to point out that attention_sinks doesn't extend the context size of a model - if the window size is 256 tokens (i.e. 252 window size and 4 sink tokens), then the model won't be able to use the full 3700 tokens of the document when generating its output. You likely know this, but I'm just making sure.

Tom Aarsen

pseudotensor · 2023-10-20T09:49:08Z

@tomaarsen Cool, thanks! It's very late here, I'll try when I wake.

I added your project to here: https://github.com/h2oai/h2ogpt

Thanks for the wonderfu project!

Yes, I'm aware it doesn't extend the context for input tokens.

One gotcha was that I didn't realize the window size had to be >= input token count. It makes sense, just the failure was not clear.

If possible it would be nice if the window automatically adjusted to the input token size instead of having to always keep it at the max just in case the model is used for more tokens up to the max.

tomaarsen · 2023-10-20T09:55:36Z

One gotcha was that I didn't realize the window size had to be >= input token count. It makes sense, just the failure was not clear.

After this PR the window size can be less than the input token count, though the excess tokens beyond the window size will be removed as the model generates. It's actually quite normal for tokens to be removed, this is how the memory usage is kept so low while the model stays fluent.

And it's very exciting to see this project included in h2oGPT! I'll try to find some time later to play around with it 😄

pseudotensor · 2023-10-21T05:05:36Z

Yes thanks, runs my test without any failure even though window size is only 252. Thanks!

tomaarsen · 2023-10-23T07:34:57Z

Awesome! Thanks for confirming :)

Shrink attention_mask if it's larger than the cache

bc99cb1

tomaarsen mentioned this pull request Oct 20, 2023

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. #22

Closed

tomaarsen merged commit 1f17f70 into main Oct 23, 2023

tomaarsen deleted the hotfix/long_input_seq branch October 23, 2023 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink attention_mask if it's larger than the cache #23

Shrink attention_mask if it's larger than the cache #23

tomaarsen commented Oct 20, 2023

pseudotensor commented Oct 20, 2023 •

edited

Loading

tomaarsen commented Oct 20, 2023 •

edited

Loading

pseudotensor commented Oct 21, 2023

tomaarsen commented Oct 23, 2023

Shrink attention_mask if it's larger than the cache #23

Shrink attention_mask if it's larger than the cache #23

Conversation

tomaarsen commented Oct 20, 2023

Pull Request overview

Details

Tests

pseudotensor commented Oct 20, 2023 • edited Loading

tomaarsen commented Oct 20, 2023 • edited Loading

pseudotensor commented Oct 21, 2023

tomaarsen commented Oct 23, 2023

pseudotensor commented Oct 20, 2023 •

edited

Loading

tomaarsen commented Oct 20, 2023 •

edited

Loading