[WIP] rework document #153

sroussey · 2026-01-03T18:24:09Z

No description provided.

Copilot

Pull request overview

This PR represents a work-in-progress rework of the document processing system, transitioning from a fragment-based approach to a hierarchical node-based architecture with schema-driven design and provenance tracking.

Replaced fragment-based Document class with a schema-based hierarchical document structure using discriminated union node types (Document, Section, Paragraph, Sentence, Topic)
Introduced comprehensive JSON schemas for runtime validation and TypeScript type generation
Added provenance tracking and variant management for supporting multiple processing configurations per document

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
packages/ai/src/task/index.ts	Removed export of deprecated DocumentSplitterTask
packages/ai/src/task/DocumentSplitterTask.ts	Deleted legacy document splitter task implementation
packages/ai/src/source/index.ts	Deleted index file for source module (exports moved to common.ts)
packages/ai/src/source/StructuralParser.ts	Added new parser for converting markdown/text into hierarchical DocumentNode trees
packages/ai/src/source/ProvenanceUtils.ts	Added utilities for extracting variant provenance and generating config IDs
packages/ai/src/source/MasterDocument.ts	Deleted legacy MasterDocument class
packages/ai/src/source/DocumentSchema.ts	Added comprehensive schemas and types for hierarchical document structure, chunks, and metadata
packages/ai/src/source/DocumentNode.ts	Added node utility functions for ID generation, traversal, and token estimation
packages/ai/src/source/DocumentConverterText.ts	Deleted legacy text converter
packages/ai/src/source/DocumentConverterMarkdown.ts	Deleted legacy markdown converter
packages/ai/src/source/DocumentConverter.ts	Deleted abstract converter base class
packages/ai/src/source/Document.ts	Refactored Document class to manage hierarchical structure with multiple processing variants
packages/ai/src/common.ts	Updated exports to reflect new document architecture
packages/ai/README.md	Removed documentation for deprecated DocumentSplitterTask

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-03T19:18:47Z

packages/ai/src/source/StructuralParser.ts

+    title: string,
+    format?: "markdown" | "text"
+  ): Promise<DocumentRootNode> {
+    if (format === "markdown" || (!format && text.includes("#"))) {


The auto-detection logic for markdown format is overly simplistic. Checking if text includes "#" will produce false positives for any text containing a hashtag or pound sign. Consider using a more robust detection method, such as checking for markdown header patterns (e.g., lines starting with # followed by whitespace) or requiring an explicit format parameter.

@copilot open a new pull request to apply changes based on this feedback

packages/ai/src/source/ProvenanceUtils.ts

Copilot · 2026-01-03T19:18:48Z

packages/ai/src/source/ProvenanceUtils.ts

+ * Check if two configIds represent the same variant
+ */
+export function isSameVariant(configId1: string, configId2: string): boolean {
+  return configId1 === configId2;
+}
+
+/**


The function isSameVariant simply compares two strings for equality. This trivial implementation provides no value beyond the standard equality operator and makes the codebase less readable. Consider removing this function and using direct string comparison where needed, or add actual variant comparison logic if there are edge cases to handle.

Suggested change

* Check if two configIds represent the same variant

*/

export function isSameVariant(configId1: string, configId2: string): boolean {

return configId1 === configId2;

}

/**

Copilot · 2026-01-03T19:18:48Z

packages/ai/src/source/ProvenanceUtils.ts

+export function mergeProvenance(...provenances: Provenance[]): Provenance {
+  const merged: Record<string, unknown> = {};
+  for (const prov of provenances) {
+    Object.assign(merged, prov);
+  }
+  return merged as Provenance;


The mergeProvenance function uses Object.assign which performs a shallow merge and will overwrite earlier values with later ones. This behavior may not be appropriate for all provenance fields. Consider documenting the merge strategy (e.g., "later provenance values override earlier ones") or implementing a more sophisticated merge strategy if nested fields need to be preserved.

Copilot · 2026-01-03T19:18:48Z

packages/ai/src/source/Document.ts

+      provenance:
+        "embeddingModel" in provenance && typeof provenance.embeddingModel === "string"
+          ? (provenance as VariantProvenance)
+          : ({
+              embeddingModel: (provenance as Provenance).embeddingModel || "",
+              chunkerStrategy: (provenance as Provenance).chunkerStrategy || "",
+              maxTokens: (provenance as Provenance).maxTokens || 512,
+              overlap: (provenance as Provenance).overlap || 0,
+              summaryModel: (provenance as Provenance).summaryModel as string | undefined,
+              nerModel: (provenance as Provenance).nerModel as string | undefined,
+            } as VariantProvenance),


The type assertion to access provenance fields is fragile and bypasses type safety. The fallback object construction assumes certain fields exist in the Provenance type without proper type guards. Consider using proper type narrowing or creating a helper function that safely extracts these fields with proper error handling for missing required fields.

@copilot open a new pull request to apply changes based on this feedback

packages/ai/src/source/StructuralParser.ts

packages/ai/src/source/ProvenanceUtils.ts

Copilot · 2026-01-03T19:18:49Z

packages/ai/src/source/ProvenanceUtils.ts

+    }
+  }
+  // Default for required field
+  return 512; // Default maxTokens


The magic number 512 is used as a default for maxTokens without explanation. Consider extracting this to a named constant (e.g., DEFAULT_MAX_TOKENS) at the module level to improve maintainability and make the default value explicit.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2026-01-03T19:18:49Z

packages/ai/src/source/DocumentNode.ts

+ * Approximate token counting (v1)
+ */
+export function estimateTokens(text: string): number {


The token estimation formula Math.ceil(text.length / 4) is a rough approximation that may not be accurate for many tokenization schemes. The actual token count can vary significantly based on the language, character set, and tokenizer used. Consider adding a comment noting this is an approximation and may need adjustment, or using an actual tokenizer library for more accurate results.

Suggested change

* Approximate token counting (v1)

*/

export function estimateTokens(text: string): number {

* Approximate token counting (v1).

*

* NOTE: This is a rough heuristic based on character length and does not reflect

* any specific model's tokenizer. It may significantly under- or over-estimate

* the true token count depending on language, encoding, and tokenizer behavior.

* For use cases that require accurate token accounting, replace this with an

* implementation that uses the appropriate tokenizer library.

*/

export function estimateTokens(text: string): number {

// Rough approximation: average ~4 characters per token for many English texts.

// Adjust or replace as needed for your target models and languages.

@copilot Add a TODO to the comment so i can come back to this later.

packages/ai/src/source/StructuralParser.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2026-01-04T00:44:28Z

@sroussey I've opened a new pull request, #157, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-01-04T00:45:10Z

@sroussey I've opened a new pull request, #158, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2026-01-04T00:48:06Z

@sroussey I've opened a new pull request, #159, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Initial plan * Improve markdown auto-detection with robust pattern matching Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

…ument.addVariant (#158) * Initial plan * Use extractConfigFields for type-safe provenance handling Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> * Add comprehensive tests for type-safe provenance handling Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

* Initial plan * Extract magic number 512 to DEFAULT_MAX_TOKENS constant Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[WIP] rework document

24ad74a

github-actions bot assigned sroussey Jan 3, 2026

sroussey requested a review from Copilot January 3, 2026 19:15

Copilot started reviewing on behalf of sroussey January 3, 2026 19:16 View session

Copilot AI reviewed Jan 3, 2026

View reviewed changes

Update packages/ai/src/source/ProvenanceUtils.ts

c8ae122

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Jan 4, 2026

Fix markdown auto-detection to use header pattern matching #157

Merged

Copilot AI mentioned this pull request Jan 4, 2026

Replace unsafe type assertions with type-safe field extraction in Document.addVariant #158

Merged

Update packages/ai/src/source/StructuralParser.ts

ef28ad5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Jan 4, 2026

Extract magic number to named constant in ProvenanceUtils #159

Merged

sroussey and others added 5 commits January 4, 2026 00:48

Update packages/ai/src/source/ProvenanceUtils.ts

eb1a15f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update packages/ai/src/source/StructuralParser.ts

9124596

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sroussey merged commit afbbf2e into rag Jan 4, 2026
1 of 2 checks passed

- * Approximate token counting (v1)
- */
-export function estimateTokens(text: string): number {
+ * Approximate token counting (v1).
+ *
+ * NOTE: This is a rough heuristic based on character length and does not reflect
+ * any specific model's tokenizer. It may significantly under- or over-estimate
+ * the true token count depending on language, encoding, and tokenizer behavior.
+ * For use cases that require accurate token accounting, replace this with an
+ * implementation that uses the appropriate tokenizer library.
+ */
+export function estimateTokens(text: string): number {
+  // Rough approximation: average ~4 characters per token for many English texts.
+  // Adjust or replace as needed for your target models and languages.

[WIP] rework document #153

[WIP] rework document #153

Uh oh!

Conversation

sroussey commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

sroussey Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

sroussey Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

sroussey Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

sroussey Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jan 4, 2026

Uh oh!

Copilot AI commented Jan 4, 2026

Uh oh!

Copilot AI commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sroussey Jan 4, 2026 •

edited

Loading