Skip to content

Commit c1eb2f7

Browse files
retr0regqnixsynapse
authored andcommitted
vocab : prevent tokenizer overflow (ggml-org#14301)
* vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow
1 parent 65c3447 commit c1eb2f7

File tree

2 files changed

+6
-0
lines changed

2 files changed

+6
-0
lines changed

include/llama.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1088,6 +1088,7 @@ extern "C" {
10881088
/// @param tokens The tokens pointer must be large enough to hold the resulting tokens.
10891089
/// @return Returns the number of tokens on success, no more than n_tokens_max
10901090
/// @return Returns a negative number on failure - the number of tokens that would have been returned
1091+
/// @return Returns INT32_MIN on overflow (e.g., tokenization result size exceeds int32_t limit)
10911092
/// @param add_special Allow to add BOS and EOS tokens if model is configured to do so.
10921093
/// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated
10931094
/// as plaintext. Does not insert a leading space.

src/llama-vocab.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3074,6 +3074,11 @@ int32_t llama_vocab::tokenize(
30743074
bool add_special,
30753075
bool parse_special) const {
30763076
auto res = tokenize(std::string(text, text_len), add_special, parse_special);
3077+
if (res.size() >= static_cast<size_t>(std::numeric_limits<int32_t>::max())) {
3078+
LLAMA_LOG_ERROR("%s: tokenization result size %zu exceeds int32_t limit\n", __func__, res.size());
3079+
return std::numeric_limits<int32_t>::min();
3080+
}
3081+
30773082
if (n_tokens_max < (int) res.size()) {
30783083
// LLAMA_LOG_ERROR("%s: too many tokens\n", __func__);
30793084
return -((int) res.size());

0 commit comments

Comments
 (0)