Allow to select tokenizers via configuration #912

dluc · 2024-11-27T07:24:02Z

Motivation and Context (Why the change? What's the scenario?)

Allow to configure tokenizers like any other dependency, removing several instances of warnings appearing in logs and using most common defaults.
When using OpenAI try to autodetect the correct tokenizer by model name, reducing the needed config.

High level description (Approach, Design)

move all tokenizers to new Tiktoken package. This also removes unnecessary references to OpenAI code.
allow to configure tokenizer dependencies, new config entries
auto detect tokenizer when using OpenAI, using the model name and internal tiktoken references
fix a few DI missing params, not being passed correctly
revisit all the default tokenizers

- allow to configure tokenizer dependencies - auto detect tokenizer when using OpenAI - fix a few DI missing params

marcominerva · 2024-11-29T09:06:40Z

Hi @dluc!

I have a question about this PR. In OpenAIConfig.cs file there is this new property:

kernel-memory/extensions/OpenAI/OpenAI/OpenAIConfig.cs

Lines 55 to 59 in 53db61a

    
           /// <summary> 
        
           /// Name of the tokenizer used to count tokens. 
        
           /// Supported values: "p50k", "cl100k", "o200k". Leave it empty for autodetect. 
        
           /// </summary> 
        
           public string TextModelTokenizer { get; set; } = string.Empty;

The corresponding property in AzureOpenAIConfig.cs is the following:

kernel-memory/extensions/AzureOpenAI/AzureOpenAI/AzureOpenAIConfig.cs

Lines 66 to 70 in 53db61a

    
           /// <summary> 
        
           /// Name of the tokenizer used to count tokens. 
        
           /// Supported values: "p50k", "cl100k", "o200k". Leave it empty if unsure. 
        
           /// </summary> 
        
           public string Tokenizer { get; set; } = "cl100k";

So, in the first case the comment says Leave empty for autodetect, in fact the default value of the property is string.Empty. On the other hand, in AzureOpenAIConfig.cs it is said Leave it empty if unsure, but the default value is cl100k. Is it intended or is it a typo?

dluc · 2024-11-29T09:53:41Z

Hi @dluc!

I have a question about this PR. In OpenAIConfig.cs file there is this new property:

kernel-memory/extensions/OpenAI/OpenAI/OpenAIConfig.cs

Lines 55 to 59 in 53db61a

/// <summary>

/// Name of the tokenizer used to count tokens.

/// Supported values: "p50k", "cl100k", "o200k". Leave it empty for autodetect.

/// </summary>

public string TextModelTokenizer { get; set; } = string.Empty;

The corresponding property in AzureOpenAIConfig.cs is the following:

kernel-memory/extensions/AzureOpenAI/AzureOpenAI/AzureOpenAIConfig.cs

Lines 66 to 70 in 53db61a

/// <summary>

/// Name of the tokenizer used to count tokens.

/// Supported values: "p50k", "cl100k", "o200k". Leave it empty if unsure.

/// </summary>

public string Tokenizer { get; set; } = "cl100k";

So, in the first case the comment says Leave empty for autodetect, in fact the default value of the property is string.Empty. On the other hand, in AzureOpenAIConfig.cs it is said Leave it empty if unsure, but the default value is cl100k. Is it intended or is it a typo?

hi @marcominerva, the comments are intentional - no typos.

When working with OpenAI, we can determine the appropriate tokenizer automatically by checking the model name. The only exception that I'm aware of, is when dealing with fine-tuned models, where we can rely on the developer to configure accordingly (maybe we should mention this in the comment).

With Azure, however, we only have a "deployment name" to work with, so the configuration defaults are designed to minimize errors and avoid warnings in the logs. Since all OpenAI embedding models use cl100k, I considered it a reasonable default. For gpt-4o text models, o200k is technically used, but the differences are minimal. While not 100% precise, this approach should avoid significant issues.

If this default causes problems, I’m open to suggestions. I did consider autodetecting the tokenizer by calling the API, but that felt overly complex and somehow expensive. Perhaps a separate tool for this task could be explored. That said, the lack of precision affects other models as well, like Llama and Anthropic, where we’re also using OpenAI tokenizers. Given this broader context, I think this implementation is reasonable, but I’m happy to hear other perspectives.

marcominerva · 2024-11-29T11:43:53Z

Thank you for the clarification 👍

Allow to select tokenizers via configuration

baf96d3

- allow to configure tokenizer dependencies - auto detect tokenizer when using OpenAI - fix a few DI missing params

dluc force-pushed the tokenizersimprov branch from 36c9488 to baf96d3 Compare November 27, 2024 07:33

Reorganize tokenizers code and dependency tree

e4f5ddd

dluc force-pushed the tokenizersimprov branch from b5d4396 to e4f5ddd Compare November 27, 2024 08:45

dluc added 2 commits November 27, 2024 11:55

Fix ONNX and other misc

8c3a7a7

Misc

61e5837

dluc merged commit 664d30f into microsoft:main Nov 27, 2024
5 checks passed

dluc deleted the tokenizersimprov branch November 27, 2024 20:37

dependabot bot mentioned this pull request Nov 1, 2025

build: Bump the nuget-deps group with 8 updates microsoft/Document-Knowledge-Mining-Solution-Accelerator#487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to select tokenizers via configuration #912

Allow to select tokenizers via configuration #912

Uh oh!

dluc commented Nov 27, 2024 •

edited

Loading

Uh oh!

Uh oh!

marcominerva commented Nov 29, 2024

Uh oh!

dluc commented Nov 29, 2024

Uh oh!

marcominerva commented Nov 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow to select tokenizers via configuration #912

Allow to select tokenizers via configuration #912

Uh oh!

Conversation

dluc commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context (Why the change? What's the scenario?)

High level description (Approach, Design)

Uh oh!

Uh oh!

marcominerva commented Nov 29, 2024

Uh oh!

dluc commented Nov 29, 2024

Uh oh!

marcominerva commented Nov 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dluc commented Nov 27, 2024 •

edited

Loading