-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token-triggered processors #1407
Comments
Nice! I feel like it's better to define a optional start and end token since it could be not intuitive to have
Furthermore user can also reuse a regex pattern e.g. a simple character counter
that reuse a regex pattern until LLM output a terminating token e.g. |
It would be better, albeit more verbose, to give users a simple interface to build custom logits processors. Here's a first idea: build_processor(cond_fns, process_fns, init_state) Where
cond_fns_i(states[i], token_ids)
process_fns_i(states[i], token_ids, logits) This is one possibility |
Option 1: Default processors + dictionary mappingI like the idea of using the native processor interface. If we think the processor interface is reasonably stable, we could do something similar to what I did in #1408. In this case we'd do class Foo(BaseModel):
a: int
class Bar(BaseModel):
b: int
# Create processor from schema
processor_1 = JSONLogitsProcessor(Foo, tokenizer)
processor_2 = JSONLogitsProcessor(Bar, tokenizer)
processor_3 = RegexLogitsProcessor(r"[0-9]{4}", tokenizer)
processor = MultiLogitsProcessor(
processor_1, # default processor
{'foo': processor_2, 'bar', processor_2} # toggleable processors
) Alternatively, uses should be able to provide an exact token ID that triggers another processor, though this is a terrible developer experience IMO. Option 2: Wrapped processorsAnother implementation might follow @903124's suggestion with optional stops by wrapping each processor in a # Create processor from schema
processor_1 = TriggeredLogitsProcessor(
JSONLogitsProcessor(Foo, tokenizer),
start = '<think>',
stop = '</think>' # optional stop string
)
processor = MultiLogitsProcessor(
processor_1,
... # other TriggeredLogitsProcessors
) This would be unstructured by default I imagine, if a processor falls outside its start/stop substring range. Then, the Of these, I think I prefer option 2. I think it could also be reasonably well wrapped by the interface @rlouf described. |
You can implement all of these with the interface I mentioned above, so we |
I think I have a better solution. First, as far as debugging go, I'm not sure your solution is necessarily better than the regex alternative. Then, it is quite complex. I think a much better solution would be to let users write e.g. from outlines.types import text
structure = text + "<think>" + text + "</think>" + regex(r"(yes|no)") Here there is only one regex that should be compiled, model(prompt, structure) into result = model(prompt, until="<think>")
prompt += result
result = model(prompt, until="</think>")
prompt += result
result = model(prompt, regex(r"(yes|no)")) It is always possible to add a debug mode: model(prompt, structure, debug=True) or maybe better have another class of generator = REPL(model, structure)
generator(prompt) |
I'm thinking about this way recently: Using the strawberry example above user can define a factory function
And do |
I think this is a good interface:
Correct me if I'm wrong, but this gives us a In which case, building this interface doesn't seem particularly difficult and strikes me as having low mental overhead. It'd be nice to have structure = text + "<think>" + text + json_block('python') + text + "</think>" + regex(r"(yes|no)") However, one issue here is that it does not seem obvious to me how to flexibly handle tokens that you may or may not need. As a (very reductive) example, suppose I want to force the model to write "cat" after the sequence "my favorite pet is a ". If you want this from an unstructured block, I'd have to write my own regex to explicitly include this condition. This is simple for my example, but complicated nested structures that trigger different types of structure could strain the compiler quite a bit. Suppose for example you did not know that you needed a LaTeX block, but if that block appears in the course of unstructured text, than you want to apply a LaTeX grammar. The |
There are certain cases where outlines users require unstructured text until a particular token is reached, such as
<|beginfunctioncall|>
, code blocks, JSON blocks. etc.Currently, these are often handled with regular expressions that can be quite messy.
DeepSeek's thinking tokens are a great example. DeepSeek R1's response begins with
<think>
and then includes some amount of unstructured text, followed by</think>
. Thinking tokens are the primary reason for R1's performance.The issue with the current solution (specifying a regular expression) is that the regular expression is quite complicated and difficult to debug:
This includes both a thinking block and the structured output at the end.
Regular expression compilation times can explode if you want to include things like thought controls that limit the thinking response, add allowed/disallowed words, or structured text to the thinking block. For example, if you force the thinking block to have a range of lengths between 10 and 50:
Compiling a single, smaller regular expression for only the thinking block could take some load off of our compiler, and simplify the user interface for complicated and mixed forms of structure.
Ideally, I would prefer to be able to structure the thinking block and the output block separately, i.e. have separate logit processors for different sections of text. You could also jump to any number of different logit processors based on tokens you hit.
Consider the following interface, where I define a default pattern
pattern_1
. When ab
token (or substring?) is hit, I want to switch topattern_2
.or, in the case of DeepSeek R1 (roughly):
I think it'd be a nice convenience feature, and it might cut down on compilation times for very complicated grammars. It would introduce some challenges in how to communicate patterns to inference servers like vLLM, which does not have a way to receive multiple patterns for constrained decoding.
Thoughts?
The text was updated successfully, but these errors were encountered: