Fix code chunking to preserve semantic units and prevent line-by-line fragmentation #21

Copilot · 2025-08-19T01:19:28Z

Problem

The current chunking service was breaking code into individual lines instead of meaningful semantic units, resulting in:

Broken syntax: Individual lines like "public static void main(String[] args) {" as separate chunks
Loss of context: Method logic scattered across multiple fragments
Poor search quality: Users getting incomplete, non-executable code snippets
Degraded RAG performance: Vector embeddings representing fragmented code concepts

Example of problematic chunking:

Chunk 1: "public static void main(String[] args) {"
Chunk 2: "int sum = numbers.stream()"
Chunk 3: ".mapToInt(Integer::intValue)"
Chunk 4: ".sum();"
Chunk 5: "System.out.println("Sum: " + sum);"
Chunk 6: "}"

Solution

Enhanced the ChunkingService with intelligent code detection and semantic-aware chunking:

1. Smart Code Detection

Recognizes code by file extensions (.java, .py, .js, .ts, .cpp, etc.)
Analyzes content patterns (function declarations, class definitions, control structures)
Uses programming character density analysis as fallback
Supports explicit language metadata from parsing

2. Semantic-Aware Chunking

Preserves complete methods and classes when possible
Splits by logical boundaries (method/class endings) instead of arbitrary word boundaries
Uses larger max chunk size (2000 chars) for code to maintain context
Enforces minimum chunk sizes (100 chars) to prevent tiny fragments
Maintains syntactic completeness where feasible

3. Improved Results

Before:

6 fragments with broken syntax

After:

Complete method preserved in single chunk:
"public static void main(String[] args) {
    int sum = numbers.stream()
        .mapToInt(Integer::intValue)
        .sum();
    System.out.println("Sum: " + sum);
}"

Key Benefits

🔍 Better Search Quality: Vector embeddings now represent complete code concepts
👨‍💻 Improved UX: Developers receive executable, complete code examples
🏗️ Semantic Preservation: Code chunks maintain logical structure and context
🔄 Backward Compatible: Non-code content continues using original chunking algorithm

Testing

Added comprehensive test suite covering:

Line-by-line code element handling
Large code method boundary splitting
Mixed code and documentation content
Pattern-based code detection without explicit metadata
Regression tests ensuring non-code content remains unchanged

Fixes #20.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot · 2025-08-19T01:19:30Z

@Yoo-SH 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

…-by-line splitting Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

Initial plan

b9f8383

Copilot AI assigned Copilot and Yoo-SH Aug 19, 2025

Copilot started work on behalf of Yoo-SH August 19, 2025 01:19 View session

Copilot AI and others added 3 commits August 19, 2025 01:24

Initial Java version fix for build compatibility

06ca037

Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

Implement code-aware chunking to preserve semantic units and fix line…

738ef24

…-by-line splitting Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

Add demonstration test and finalize code chunking enhancement

e855455

Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Code chunking breaks line-by-line causing semantic loss and poor search quality' labels: bug assignees:~~ Fix code chunking to preserve semantic units and prevent line-by-line fragmentation Aug 19, 2025

Copilot AI requested a review from Yoo-SH August 19, 2025 01:36

Copilot finished work on behalf of Yoo-SH August 19, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix code chunking to preserve semantic units and prevent line-by-line fragmentation #21

Fix code chunking to preserve semantic units and prevent line-by-line fragmentation #21

Uh oh!

Copilot AI commented Aug 19, 2025 •

edited

Loading

Uh oh!

Copilot AI commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix code chunking to preserve semantic units and prevent line-by-line fragmentation #21

Are you sure you want to change the base?

Fix code chunking to preserve semantic units and prevent line-by-line fragmentation #21

Uh oh!

Conversation

Copilot AI commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Smart Code Detection

2. Semantic-Aware Chunking

3. Improved Results

Key Benefits

Testing

Uh oh!

Copilot AI commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 19, 2025 •

edited

Loading