Skip to content

Conversation

@sroussey
Copy link
Collaborator

@sroussey sroussey commented Jan 3, 2026

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR represents a work-in-progress rework of the document processing system, transitioning from a fragment-based approach to a hierarchical node-based architecture with schema-driven design and provenance tracking.

  • Replaced fragment-based Document class with a schema-based hierarchical document structure using discriminated union node types (Document, Section, Paragraph, Sentence, Topic)
  • Introduced comprehensive JSON schemas for runtime validation and TypeScript type generation
  • Added provenance tracking and variant management for supporting multiple processing configurations per document

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
packages/ai/src/task/index.ts Removed export of deprecated DocumentSplitterTask
packages/ai/src/task/DocumentSplitterTask.ts Deleted legacy document splitter task implementation
packages/ai/src/source/index.ts Deleted index file for source module (exports moved to common.ts)
packages/ai/src/source/StructuralParser.ts Added new parser for converting markdown/text into hierarchical DocumentNode trees
packages/ai/src/source/ProvenanceUtils.ts Added utilities for extracting variant provenance and generating config IDs
packages/ai/src/source/MasterDocument.ts Deleted legacy MasterDocument class
packages/ai/src/source/DocumentSchema.ts Added comprehensive schemas and types for hierarchical document structure, chunks, and metadata
packages/ai/src/source/DocumentNode.ts Added node utility functions for ID generation, traversal, and token estimation
packages/ai/src/source/DocumentConverterText.ts Deleted legacy text converter
packages/ai/src/source/DocumentConverterMarkdown.ts Deleted legacy markdown converter
packages/ai/src/source/DocumentConverter.ts Deleted abstract converter base class
packages/ai/src/source/Document.ts Refactored Document class to manage hierarchical structure with multiple processing variants
packages/ai/src/common.ts Updated exports to reflect new document architecture
packages/ai/README.md Removed documentation for deprecated DocumentSplitterTask

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

title: string,
format?: "markdown" | "text"
): Promise<DocumentRootNode> {
if (format === "markdown" || (!format && text.includes("#"))) {
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-detection logic for markdown format is overly simplistic. Checking if text includes "#" will produce false positives for any text containing a hashtag or pound sign. Consider using a more robust detection method, such as checking for markdown header patterns (e.g., lines starting with # followed by whitespace) or requiring an explicit format parameter.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +165 to +171
* Check if two configIds represent the same variant
*/
export function isSameVariant(configId1: string, configId2: string): boolean {
return configId1 === configId2;
}

/**
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function isSameVariant simply compares two strings for equality. This trivial implementation provides no value beyond the standard equality operator and makes the codebase less readable. Consider removing this function and using direct string comparison where needed, or add actual variant comparison logic if there are edge cases to handle.

Suggested change
* Check if two configIds represent the same variant
*/
export function isSameVariant(configId1: string, configId2: string): boolean {
return configId1 === configId2;
}
/**

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +161
export function mergeProvenance(...provenances: Provenance[]): Provenance {
const merged: Record<string, unknown> = {};
for (const prov of provenances) {
Object.assign(merged, prov);
}
return merged as Provenance;
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mergeProvenance function uses Object.assign which performs a shallow merge and will overwrite earlier values with later ones. This behavior may not be appropriate for all provenance fields. Consider documenting the merge strategy (e.g., "later provenance values override earlier ones") or implementing a more sophisticated merge strategy if nested fields need to be preserved.

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 66
provenance:
"embeddingModel" in provenance && typeof provenance.embeddingModel === "string"
? (provenance as VariantProvenance)
: ({
embeddingModel: (provenance as Provenance).embeddingModel || "",
chunkerStrategy: (provenance as Provenance).chunkerStrategy || "",
maxTokens: (provenance as Provenance).maxTokens || 512,
overlap: (provenance as Provenance).overlap || 0,
summaryModel: (provenance as Provenance).summaryModel as string | undefined,
nerModel: (provenance as Provenance).nerModel as string | undefined,
} as VariantProvenance),
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type assertion to access provenance fields is fragile and bypasses type safety. The fallback object construction assumes certain fields exist in the Provenance type without proper type guards. Consider using proper type narrowing or creating a helper function that safely extracts these fields with proper error handling for missing required fields.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

}
}
// Default for required field
return 512; // Default maxTokens
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 512 is used as a default for maxTokens without explanation. Consider extracting this to a named constant (e.g., DEFAULT_MAX_TOKENS) at the module level to improve maintainability and make the default value explicit.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +71 to +73
* Approximate token counting (v1)
*/
export function estimateTokens(text: string): number {
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token estimation formula Math.ceil(text.length / 4) is a rough approximation that may not be accurate for many tokenization schemes. The actual token count can vary significantly based on the language, character set, and tokenizer used. Consider adding a comment noting this is an approximation and may need adjustment, or using an actual tokenizer library for more accurate results.

Suggested change
* Approximate token counting (v1)
*/
export function estimateTokens(text: string): number {
* Approximate token counting (v1).
*
* NOTE: This is a rough heuristic based on character length and does not reflect
* any specific model's tokenizer. It may significantly under- or over-estimate
* the true token count depending on language, encoding, and tokenizer behavior.
* For use cases that require accurate token accounting, replace this with an
* implementation that uses the appropriate tokenizer library.
*/
export function estimateTokens(text: string): number {
// Rough approximation: average ~4 characters per token for many English texts.
// Adjust or replace as needed for your target models and languages.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@sroussey sroussey Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Add a TODO to the comment so i can come back to this later.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI commented Jan 4, 2026

@sroussey I've opened a new pull request, #157, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Jan 4, 2026

@sroussey I've opened a new pull request, #158, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI commented Jan 4, 2026

@sroussey I've opened a new pull request, #159, to work on those changes. Once the pull request is ready, I'll request review from you.

sroussey and others added 5 commits January 4, 2026 00:48
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial plan

* Improve markdown auto-detection with robust pattern matching

Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
…ument.addVariant (#158)

* Initial plan

* Use extractConfigFields for type-safe provenance handling

Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

* Add comprehensive tests for type-safe provenance handling

Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
* Initial plan

* Extract magic number 512 to DEFAULT_MAX_TOKENS constant

Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sroussey sroussey merged commit afbbf2e into rag Jan 4, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants