Skip to content

Conversation

Copy link

Copilot AI commented Aug 19, 2025

Problem

The current chunking service was breaking code into individual lines instead of meaningful semantic units, resulting in:

  • Broken syntax: Individual lines like "public static void main(String[] args) {" as separate chunks
  • Loss of context: Method logic scattered across multiple fragments
  • Poor search quality: Users getting incomplete, non-executable code snippets
  • Degraded RAG performance: Vector embeddings representing fragmented code concepts

Example of problematic chunking:

Chunk 1: "public static void main(String[] args) {"
Chunk 2: "int sum = numbers.stream()"
Chunk 3: ".mapToInt(Integer::intValue)"
Chunk 4: ".sum();"
Chunk 5: "System.out.println("Sum: " + sum);"
Chunk 6: "}"

Solution

Enhanced the ChunkingService with intelligent code detection and semantic-aware chunking:

1. Smart Code Detection

  • Recognizes code by file extensions (.java, .py, .js, .ts, .cpp, etc.)
  • Analyzes content patterns (function declarations, class definitions, control structures)
  • Uses programming character density analysis as fallback
  • Supports explicit language metadata from parsing

2. Semantic-Aware Chunking

  • Preserves complete methods and classes when possible
  • Splits by logical boundaries (method/class endings) instead of arbitrary word boundaries
  • Uses larger max chunk size (2000 chars) for code to maintain context
  • Enforces minimum chunk sizes (100 chars) to prevent tiny fragments
  • Maintains syntactic completeness where feasible

3. Improved Results

Before:

6 fragments with broken syntax

After:

Complete method preserved in single chunk:
"public static void main(String[] args) {
    int sum = numbers.stream()
        .mapToInt(Integer::intValue)
        .sum();
    System.out.println("Sum: " + sum);
}"

Key Benefits

  • 🔍 Better Search Quality: Vector embeddings now represent complete code concepts
  • 👨‍💻 Improved UX: Developers receive executable, complete code examples
  • 🏗️ Semantic Preservation: Code chunks maintain logical structure and context
  • 🔄 Backward Compatible: Non-code content continues using original chunking algorithm

Testing

Added comprehensive test suite covering:

  • Line-by-line code element handling
  • Large code method boundary splitting
  • Mixed code and documentation content
  • Pattern-based code detection without explicit metadata
  • Regression tests ensuring non-code content remains unchanged

Fixes #20.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link
Author

Copilot AI commented Aug 19, 2025

@Yoo-SH 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Copilot AI and others added 3 commits August 19, 2025 01:24
Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>
…-by-line splitting

Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>
Co-authored-by: Yoo-SH <132035249+Yoo-SH@users.noreply.github.com>
Copilot AI changed the title [WIP] Code chunking breaks line-by-line causing semantic loss and poor search quality' labels: bug assignees: Fix code chunking to preserve semantic units and prevent line-by-line fragmentation Aug 19, 2025
Copilot AI requested a review from Yoo-SH August 19, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code chunking breaks line-by-line causing semantic loss and poor search quality' labels: bug assignees:

2 participants