Skip to content

Conversation

@dluc
Copy link
Collaborator

@dluc dluc commented Nov 27, 2024

Motivation and Context (Why the change? What's the scenario?)

Allow to configure tokenizers like any other dependency, removing several instances of warnings appearing in logs and using most common defaults.
When using OpenAI try to autodetect the correct tokenizer by model name, reducing the needed config.

High level description (Approach, Design)

  • move all tokenizers to new Tiktoken package. This also removes unnecessary references to OpenAI code.
  • allow to configure tokenizer dependencies, new config entries
  • auto detect tokenizer when using OpenAI, using the model name and internal tiktoken references
  • fix a few DI missing params, not being passed correctly
  • revisit all the default tokenizers

- allow to configure tokenizer dependencies
- auto detect tokenizer when using OpenAI
- fix a few DI missing params
@dluc dluc merged commit 664d30f into microsoft:main Nov 27, 2024
5 checks passed
@dluc dluc deleted the tokenizersimprov branch November 27, 2024 20:37
@marcominerva
Copy link
Contributor

Hi @dluc!

I have a question about this PR. In OpenAIConfig.cs file there is this new property:

/// <summary>
/// Name of the tokenizer used to count tokens.
/// Supported values: "p50k", "cl100k", "o200k". Leave it empty for autodetect.
/// </summary>
public string TextModelTokenizer { get; set; } = string.Empty;

The corresponding property in AzureOpenAIConfig.cs is the following:

/// <summary>
/// Name of the tokenizer used to count tokens.
/// Supported values: "p50k", "cl100k", "o200k". Leave it empty if unsure.
/// </summary>
public string Tokenizer { get; set; } = "cl100k";

So, in the first case the comment says Leave empty for autodetect, in fact the default value of the property is string.Empty. On the other hand, in AzureOpenAIConfig.cs it is said Leave it empty if unsure, but the default value is cl100k. Is it intended or is it a typo?

@dluc
Copy link
Collaborator Author

dluc commented Nov 29, 2024

Hi @dluc!

I have a question about this PR. In OpenAIConfig.cs file there is this new property:

/// <summary>
/// Name of the tokenizer used to count tokens.
/// Supported values: "p50k", "cl100k", "o200k". Leave it empty for autodetect.
/// </summary>
public string TextModelTokenizer { get; set; } = string.Empty;

The corresponding property in AzureOpenAIConfig.cs is the following:

/// <summary>
/// Name of the tokenizer used to count tokens.
/// Supported values: "p50k", "cl100k", "o200k". Leave it empty if unsure.
/// </summary>
public string Tokenizer { get; set; } = "cl100k";

So, in the first case the comment says Leave empty for autodetect, in fact the default value of the property is string.Empty. On the other hand, in AzureOpenAIConfig.cs it is said Leave it empty if unsure, but the default value is cl100k. Is it intended or is it a typo?

hi @marcominerva, the comments are intentional - no typos.

When working with OpenAI, we can determine the appropriate tokenizer automatically by checking the model name. The only exception that I'm aware of, is when dealing with fine-tuned models, where we can rely on the developer to configure accordingly (maybe we should mention this in the comment).

With Azure, however, we only have a "deployment name" to work with, so the configuration defaults are designed to minimize errors and avoid warnings in the logs. Since all OpenAI embedding models use cl100k, I considered it a reasonable default. For gpt-4o text models, o200k is technically used, but the differences are minimal. While not 100% precise, this approach should avoid significant issues.

If this default causes problems, I’m open to suggestions. I did consider autodetecting the tokenizer by calling the API, but that felt overly complex and somehow expensive. Perhaps a separate tool for this task could be explored. That said, the lack of precision affects other models as well, like Llama and Anthropic, where we’re also using OpenAI tokenizers. Given this broader context, I think this implementation is reasonable, but I’m happy to hear other perspectives.

@marcominerva
Copy link
Contributor

Thank you for the clarification 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants