-
Notifications
You must be signed in to change notification settings - Fork 398
Allow to select tokenizers via configuration #912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- allow to configure tokenizer dependencies - auto detect tokenizer when using OpenAI - fix a few DI missing params
36c9488 to
baf96d3
Compare
b5d4396 to
e4f5ddd
Compare
|
Hi @dluc! I have a question about this PR. In OpenAIConfig.cs file there is this new property: kernel-memory/extensions/OpenAI/OpenAI/OpenAIConfig.cs Lines 55 to 59 in 53db61a
The corresponding property in AzureOpenAIConfig.cs is the following: kernel-memory/extensions/AzureOpenAI/AzureOpenAI/AzureOpenAIConfig.cs Lines 66 to 70 in 53db61a
So, in the first case the comment says Leave empty for autodetect, in fact the default value of the property is |
hi @marcominerva, the comments are intentional - no typos. When working with OpenAI, we can determine the appropriate tokenizer automatically by checking the model name. The only exception that I'm aware of, is when dealing with fine-tuned models, where we can rely on the developer to configure accordingly (maybe we should mention this in the comment). With Azure, however, we only have a "deployment name" to work with, so the configuration defaults are designed to minimize errors and avoid warnings in the logs. Since all OpenAI embedding models use cl100k, I considered it a reasonable default. For gpt-4o text models, o200k is technically used, but the differences are minimal. While not 100% precise, this approach should avoid significant issues. If this default causes problems, I’m open to suggestions. I did consider autodetecting the tokenizer by calling the API, but that felt overly complex and somehow expensive. Perhaps a separate tool for this task could be explored. That said, the lack of precision affects other models as well, like Llama and Anthropic, where we’re also using OpenAI tokenizers. Given this broader context, I think this implementation is reasonable, but I’m happy to hear other perspectives. |
|
Thank you for the clarification 👍 |
Motivation and Context (Why the change? What's the scenario?)
Allow to configure tokenizers like any other dependency, removing several instances of warnings appearing in logs and using most common defaults.
When using OpenAI try to autodetect the correct tokenizer by model name, reducing the needed config.
High level description (Approach, Design)