Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harrison/tiktoken spec #964

Merged
merged 2 commits into from
Feb 10, 2023
Merged

Harrison/tiktoken spec #964

merged 2 commits into from
Feb 10, 2023

Conversation

hwchase17
Copy link
Contributor

No description provided.

jamescalam and others added 2 commits February 9, 2023 23:16
The PR allows for `allowed_special` and `disallowed_special` parameters
to be used (see issue #923 ). The default parameters for these are
`set()` and `"all"` respectively [as per the
code](https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L74).

The reason this is needed is because when a GPT special token appears in
some text to be encoded, an error will be raised (see issue #923 ) -
using these special token params is the only way to get around it.

Also added the same functionality for the `TokenTextSplitter`, so now
this will work:

```python
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name=encoder_name,
    chunk_size=300,
    chunk_overlap=50
)
text_splitter.split_text(
    some_text, 
    disallowed_special=()
)
```
@hwchase17 hwchase17 merged commit ba54d36 into master Feb 10, 2023
@hwchase17 hwchase17 deleted the harrison/tiktoken-spec branch February 10, 2023 07:30
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
Co-authored-by: James Briggs <[email protected]>
Co-authored-by: Harrison Chase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants