Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

Replace the rb-gsl dependency with a self-contained C extension that implements Vector, Matrix, and Jacobi SVD operations. This eliminates the need for users to install external libraries (like libgsl) while providing significant performance improvements.

  • Add native C extension (~850 lines) with Vector, Matrix, and SVD implementations
  • Port existing Ruby Jacobi SVD algorithm to C for consistent results
  • Auto-detect backend: native extension → pure Ruby fallback
  • Remove all GSL-related code and dependencies
  • Update benchmarks and documentation

Performance

Documents build_index Overall Speedup
5 7x 2.6x
10 25x 4.6x
15 112x 14.5x
20 385x 48.7x
Detailed benchmark (20 documents)
Operation            Pure Ruby     Native C      Speedup
----------------------------------------------------------
build_index            0.5540       0.0014       384.5x
classify               0.0190       0.0060         3.2x
search                 0.0145       0.0037         3.9x
find_related           0.0098       0.0011         8.6x
----------------------------------------------------------
TOTAL                  0.5973       0.0123        48.7x

Test Plan

  • All 113 tests pass with native C extension
  • All 91 tests pass with pure Ruby fallback (NATIVE_VECTOR=true)
  • Benchmark comparison shows expected speedups
  • No compiler warnings in C code
  • No Ruby warnings during test runs

Files Changed

New:

  • ext/classifier/ - C extension source files
  • test/linalg/native_ext_test.rb - Unit tests for native extension

Modified:

  • classifier.gemspec - Add extension configuration
  • lib/classifier/lsi.rb - Backend detection logic
  • README.md / CLAUDE.md - Documentation updates

Removed:

  • lib/classifier/extensions/vector_serialize.rb - GSL serialization (no longer needed)

Closes #87

Replace the rb-gsl dependency with a self-contained C extension that
implements Vector, Matrix, and Jacobi SVD operations. This eliminates
the need for users to install external libraries while providing
significant performance improvements.

The native extension provides 5-50x speedup over pure Ruby, with the
SVD-heavy build_index operation showing up to 384x improvement on
larger document sets. The implementation ports the existing Ruby
Jacobi SVD algorithm to C, ensuring consistent results.

Key changes:
- Add ext/classifier/ with ~850 lines of C code
- Implement Classifier::Linalg::Vector and Matrix classes
- Port Jacobi SVD from Ruby to C
- Auto-detect backend: native extension > pure Ruby fallback
- Remove GSL-related code and dependencies
- Update benchmarks to compare native C vs pure Ruby

Closes #87
Apply RuboCop autocorrections and add necessary inline disables:
- Use %i symbol array syntax in Rakefile
- Add empty lines before assertion methods per Minitest style
- Convert float assert_equal to assert_in_delta for precision
- Disable Style/GlobalVars for $CFLAGS (required for mkmf)
- Disable Style/MapIntoArray in test intentionally testing each
@cardmagic cardmagic requested a review from Copilot December 28, 2025 15:50
@cardmagic cardmagic self-assigned this Dec 28, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the optional rb-gsl dependency with a self-contained native C extension that implements Vector, Matrix, and Jacobi SVD operations for LSI acceleration. The implementation provides significant performance improvements (5-50x speedup) while eliminating the need for users to install external libraries like libgsl.

Key Changes:

  • Added ~850 lines of C code implementing linear algebra operations with automatic fallback to pure Ruby
  • Replaced GSL-specific code with backend detection system supporting both native and pure Ruby implementations
  • Updated all documentation, benchmarks, and tests to reflect the new architecture

Reviewed changes

Copilot reviewed 17 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ext/classifier/classifier_ext.c Main entry point for native C extension
ext/classifier/linalg.h Header file defining structures and function declarations
ext/classifier/vector.c Vector implementation with operations like normalize, dot product, sum
ext/classifier/matrix.c Matrix implementation with operations like transpose, multiply, column access
ext/classifier/svd.c Jacobi SVD decomposition algorithm ported from Ruby
ext/classifier/extconf.rb Build configuration for native extension
lib/classifier/lsi.rb Backend detection logic replacing GSL checks
lib/classifier/lsi/content_node.rb Updated to use backend-agnostic vector/matrix classes
lib/classifier/extensions/vector.rb Added normalize method override for Ruby stdlib Vector
lib/classifier/extensions/vector_serialize.rb Removed GSL serialization code (no longer needed)
test/linalg/native_ext_test.rb Comprehensive unit tests for native extension
test/lsi/lsi_test.rb Relaxed test assertions to handle minor numerical differences between backends
classifier.gemspec Added extension configuration and rake-compiler dependency
Rakefile Added compile task for building native extension
benchmark/lsi_benchmark.rb Updated benchmark script to compare native C vs pure Ruby
README.md Updated installation instructions and performance benchmarks
CLAUDE.md Updated documentation to reflect new native extension architecture

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


parts.each do |part|
assert all_texts.any? { |t| t.include?(part.gsub('This text ', '').split.first) },
"Summary part '#{part}' should be from test texts"
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'recieve' to 'receive'.

Copilot uses AI. Check for mistakes.
double sum_diff = 0.0;
for (size_t i = 0; i < size; i++) {
double diff = fabs(MAT_AT(q, i, i) - MAT_AT(prev_q, i, i));
if (diff > 0.001) {
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convergence threshold value 0.001 is duplicated here and at line 12 as SVD_CONVERGENCE_THRESHOLD. Use the constant instead of the magic number for consistency.

Suggested change
if (diff > 0.001) {
if (diff > SVD_CONVERGENCE_THRESHOLD) {

Copilot uses AI. Check for mistakes.
# Cache magnitude since Vector is immutable after creation
# Note: We intentionally override the matrix gem's normalize method
# to provide a more robust implementation that handles zero vectors
undef_method :normalize if method_defined?(:normalize)
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "We intentionally override the matrix gem's normalize method" but the code uses undef_method which removes the method entirely before redefining it. Consider clarifying that this undefines the existing method first to avoid conflicts, not overrides it directly.

Copilot uses AI. Check for mistakes.
- Use SVD_CONVERGENCE_THRESHOLD constant instead of magic number
- Clarify comment about undef_method vs override behavior
@cardmagic cardmagic merged commit acbc053 into master Dec 28, 2025
5 checks passed
@cardmagic cardmagic deleted the native-linalg-extension branch December 28, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore minimal GSL C extension to replace rb-gsl dependency

2 participants