-
Notifications
You must be signed in to change notification settings - Fork 1
[WIP] rework document #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR represents a work-in-progress rework of the document processing system, transitioning from a fragment-based approach to a hierarchical node-based architecture with schema-driven design and provenance tracking.
- Replaced fragment-based Document class with a schema-based hierarchical document structure using discriminated union node types (Document, Section, Paragraph, Sentence, Topic)
- Introduced comprehensive JSON schemas for runtime validation and TypeScript type generation
- Added provenance tracking and variant management for supporting multiple processing configurations per document
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/ai/src/task/index.ts | Removed export of deprecated DocumentSplitterTask |
| packages/ai/src/task/DocumentSplitterTask.ts | Deleted legacy document splitter task implementation |
| packages/ai/src/source/index.ts | Deleted index file for source module (exports moved to common.ts) |
| packages/ai/src/source/StructuralParser.ts | Added new parser for converting markdown/text into hierarchical DocumentNode trees |
| packages/ai/src/source/ProvenanceUtils.ts | Added utilities for extracting variant provenance and generating config IDs |
| packages/ai/src/source/MasterDocument.ts | Deleted legacy MasterDocument class |
| packages/ai/src/source/DocumentSchema.ts | Added comprehensive schemas and types for hierarchical document structure, chunks, and metadata |
| packages/ai/src/source/DocumentNode.ts | Added node utility functions for ID generation, traversal, and token estimation |
| packages/ai/src/source/DocumentConverterText.ts | Deleted legacy text converter |
| packages/ai/src/source/DocumentConverterMarkdown.ts | Deleted legacy markdown converter |
| packages/ai/src/source/DocumentConverter.ts | Deleted abstract converter base class |
| packages/ai/src/source/Document.ts | Refactored Document class to manage hierarchical structure with multiple processing variants |
| packages/ai/src/common.ts | Updated exports to reflect new document architecture |
| packages/ai/README.md | Removed documentation for deprecated DocumentSplitterTask |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| title: string, | ||
| format?: "markdown" | "text" | ||
| ): Promise<DocumentRootNode> { | ||
| if (format === "markdown" || (!format && text.includes("#"))) { |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-detection logic for markdown format is overly simplistic. Checking if text includes "#" will produce false positives for any text containing a hashtag or pound sign. Consider using a more robust detection method, such as checking for markdown header patterns (e.g., lines starting with # followed by whitespace) or requiring an explicit format parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| * Check if two configIds represent the same variant | ||
| */ | ||
| export function isSameVariant(configId1: string, configId2: string): boolean { | ||
| return configId1 === configId2; | ||
| } | ||
|
|
||
| /** |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function isSameVariant simply compares two strings for equality. This trivial implementation provides no value beyond the standard equality operator and makes the codebase less readable. Consider removing this function and using direct string comparison where needed, or add actual variant comparison logic if there are edge cases to handle.
| * Check if two configIds represent the same variant | |
| */ | |
| export function isSameVariant(configId1: string, configId2: string): boolean { | |
| return configId1 === configId2; | |
| } | |
| /** |
| export function mergeProvenance(...provenances: Provenance[]): Provenance { | ||
| const merged: Record<string, unknown> = {}; | ||
| for (const prov of provenances) { | ||
| Object.assign(merged, prov); | ||
| } | ||
| return merged as Provenance; |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mergeProvenance function uses Object.assign which performs a shallow merge and will overwrite earlier values with later ones. This behavior may not be appropriate for all provenance fields. Consider documenting the merge strategy (e.g., "later provenance values override earlier ones") or implementing a more sophisticated merge strategy if nested fields need to be preserved.
packages/ai/src/source/Document.ts
Outdated
| provenance: | ||
| "embeddingModel" in provenance && typeof provenance.embeddingModel === "string" | ||
| ? (provenance as VariantProvenance) | ||
| : ({ | ||
| embeddingModel: (provenance as Provenance).embeddingModel || "", | ||
| chunkerStrategy: (provenance as Provenance).chunkerStrategy || "", | ||
| maxTokens: (provenance as Provenance).maxTokens || 512, | ||
| overlap: (provenance as Provenance).overlap || 0, | ||
| summaryModel: (provenance as Provenance).summaryModel as string | undefined, | ||
| nerModel: (provenance as Provenance).nerModel as string | undefined, | ||
| } as VariantProvenance), |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type assertion to access provenance fields is fragile and bypasses type safety. The fallback object construction assumes certain fields exist in the Provenance type without proper type guards. Consider using proper type narrowing or creating a helper function that safely extracts these fields with proper error handling for missing required fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| } | ||
| } | ||
| // Default for required field | ||
| return 512; // Default maxTokens |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 512 is used as a default for maxTokens without explanation. Consider extracting this to a named constant (e.g., DEFAULT_MAX_TOKENS) at the module level to improve maintainability and make the default value explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| * Approximate token counting (v1) | ||
| */ | ||
| export function estimateTokens(text: string): number { |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The token estimation formula Math.ceil(text.length / 4) is a rough approximation that may not be accurate for many tokenization schemes. The actual token count can vary significantly based on the language, character set, and tokenizer used. Consider adding a comment noting this is an approximation and may need adjustment, or using an actual tokenizer library for more accurate results.
| * Approximate token counting (v1) | |
| */ | |
| export function estimateTokens(text: string): number { | |
| * Approximate token counting (v1). | |
| * | |
| * NOTE: This is a rough heuristic based on character length and does not reflect | |
| * any specific model's tokenizer. It may significantly under- or over-estimate | |
| * the true token count depending on language, encoding, and tokenizer behavior. | |
| * For use cases that require accurate token accounting, replace this with an | |
| * implementation that uses the appropriate tokenizer library. | |
| */ | |
| export function estimateTokens(text: string): number { | |
| // Rough approximation: average ~4 characters per token for many English texts. | |
| // Adjust or replace as needed for your target models and languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Add a TODO to the comment so i can come back to this later.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial plan * Improve markdown auto-detection with robust pattern matching Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
…ument.addVariant (#158) * Initial plan * Use extractConfigFields for type-safe provenance handling Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> * Add comprehensive tests for type-safe provenance handling Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
* Initial plan * Extract magic number 512 to DEFAULT_MAX_TOKENS constant Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.