You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm afraid that it's somewhat challenging to present a minimal working example of the problem, but I hope the following snippet offers some useful context:
@lmql.query
def annotate(prefix, words, annotator, initial, terminal):
"""lmql
annotator.digest(prefix)
"{prefix}"
async for i, (word, sep) in asyncstdlib.enumerate(words, start=1):
annotator.digest(word, sep)
tags, max_tokens = annotator.get_tags()
if annotator.digested + max_tokens >= annotator.max_digestable:
break
elif i == 2 and initial:
annotator.progress_region()
elif i == len(words) and terminal:
annotator.progress_region()
"{word}[@annotator.postprocess TAG]{sep}" where TAG in tags
"""
I am using LMQL for a span annotation task in which the generative model does not need to add text except for select meta-tokens indicating the opening and closing of spans. I'm using LMQL to get big savings, in this case, as the vast majority of the text remains unchanged in this use case, and the tokens that are added are always from a very narrow set. However, I encounter the problem that even if the passage I'm submitting to the LMQL server is less than the context length, unless it occupies literally <=50% of the model's context, there are at least some edge cases in which I get a CUDA indexing error as a result of LMQL supplying the model with too large a prompt. This restriction seems wasteful and unnecessarily restrictive, since in training I always know myself to be leaving enough spare context for the correct span annotations to fit. Therefore, I want the prompt learning technique I'm using to be able to adapt and adjust to using only the amount of the context that it actually needs to cause the LMQL model to replicate the correct spans (the prompt learning technique adjusts prefix in the above code). Therefore, I want LMQL to assess, dynamically, whether the whole context has been occupied, and gracefully exit when this is the case, rather than simply passing on the raw CUDA error. I think this is a challenge that LMQL users working with long documents in this setting would frequently encounter, as would anyone in any setting where one actually comes up against context size limits. In such settings, they can do something similar to what I did, but overall it causes some nasty and pointless-seeming code duplication where the LMQL code on the user side must use a duplicate tokenizer to be constantly checking when to abort. It seems like this behavior would somehow be trivial to implement on the lowest level (simply don't call generate with too big a context). It would then simply be true, by default, that an LMQL query cannot overload the context of the model doing the generation. This would be a nice guarantee for LMQL itself, to offer, rather than being something that must be reengineered each time.
The text was updated successfully, but these errors were encountered:
I'm afraid that it's somewhat challenging to present a minimal working example of the problem, but I hope the following snippet offers some useful context:
I am using LMQL for a span annotation task in which the generative model does not need to add text except for select meta-tokens indicating the opening and closing of spans. I'm using LMQL to get big savings, in this case, as the vast majority of the text remains unchanged in this use case, and the tokens that are added are always from a very narrow set. However, I encounter the problem that even if the passage I'm submitting to the LMQL server is less than the context length, unless it occupies literally <=50% of the model's context, there are at least some edge cases in which I get a CUDA indexing error as a result of LMQL supplying the model with too large a prompt. This restriction seems wasteful and unnecessarily restrictive, since in training I always know myself to be leaving enough spare context for the correct span annotations to fit. Therefore, I want the prompt learning technique I'm using to be able to adapt and adjust to using only the amount of the context that it actually needs to cause the LMQL model to replicate the correct spans (the prompt learning technique adjusts
prefix
in the above code). Therefore, I want LMQL to assess, dynamically, whether the whole context has been occupied, and gracefully exit when this is the case, rather than simply passing on the raw CUDA error. I think this is a challenge that LMQL users working with long documents in this setting would frequently encounter, as would anyone in any setting where one actually comes up against context size limits. In such settings, they can do something similar to what I did, but overall it causes some nasty and pointless-seeming code duplication where the LMQL code on the user side must use a duplicate tokenizer to be constantly checking when to abort. It seems like this behavior would somehow be trivial to implement on the lowest level (simply don't call generate with too big a context). It would then simply be true, by default, that an LMQL query cannot overload the context of the model doing the generation. This would be a nice guarantee for LMQL itself, to offer, rather than being something that must be reengineered each time.The text was updated successfully, but these errors were encountered: