Harrison/tiktoken spec #964

hwchase17 · 2023-02-10T07:18:45Z

No description provided.

The PR allows for `allowed_special` and `disallowed_special` parameters to be used (see issue #923 ). The default parameters for these are `set()` and `"all"` respectively [as per the code](https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L74). The reason this is needed is because when a GPT special token appears in some text to be encoded, an error will be raised (see issue #923 ) - using these special token params is the only way to get around it. Also added the same functionality for the `TokenTextSplitter`, so now this will work: ```python from langchain.text_splitter import TokenTextSplitter text_splitter = TokenTextSplitter.from_tiktoken_encoder( encoding_name=encoder_name, chunk_size=300, chunk_overlap=50 ) text_splitter.split_text( some_text, disallowed_special=() ) ```

Co-authored-by: James Briggs <[email protected]> Co-authored-by: Harrison Chase <[email protected]>

jamescalam and others added 2 commits February 9, 2023 23:16

cr

49645c1

hwchase17 merged commit ba54d36 into master Feb 10, 2023

hwchase17 deleted the harrison/tiktoken-spec branch February 10, 2023 07:30

blob42 mentioned this pull request Feb 21, 2023

fix searx blob42/langchain#1

Closed

zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023

Harrison/tiktoken spec (langchain-ai#964)

3a475d7

Co-authored-by: James Briggs <[email protected]> Co-authored-by: Harrison Chase <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harrison/tiktoken spec #964

Harrison/tiktoken spec #964

hwchase17 commented Feb 10, 2023

Harrison/tiktoken spec #964

Harrison/tiktoken spec #964

Conversation

hwchase17 commented Feb 10, 2023