-
Notifications
You must be signed in to change notification settings - Fork 123
Add zero-dependency native C extension for LSI acceleration #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace the rb-gsl dependency with a self-contained C extension that implements Vector, Matrix, and Jacobi SVD operations. This eliminates the need for users to install external libraries while providing significant performance improvements. The native extension provides 5-50x speedup over pure Ruby, with the SVD-heavy build_index operation showing up to 384x improvement on larger document sets. The implementation ports the existing Ruby Jacobi SVD algorithm to C, ensuring consistent results. Key changes: - Add ext/classifier/ with ~850 lines of C code - Implement Classifier::Linalg::Vector and Matrix classes - Port Jacobi SVD from Ruby to C - Auto-detect backend: native extension > pure Ruby fallback - Remove GSL-related code and dependencies - Update benchmarks to compare native C vs pure Ruby Closes #87
Apply RuboCop autocorrections and add necessary inline disables: - Use %i symbol array syntax in Rakefile - Add empty lines before assertion methods per Minitest style - Convert float assert_equal to assert_in_delta for precision - Disable Style/GlobalVars for $CFLAGS (required for mkmf) - Disable Style/MapIntoArray in test intentionally testing each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR replaces the optional rb-gsl dependency with a self-contained native C extension that implements Vector, Matrix, and Jacobi SVD operations for LSI acceleration. The implementation provides significant performance improvements (5-50x speedup) while eliminating the need for users to install external libraries like libgsl.
Key Changes:
- Added ~850 lines of C code implementing linear algebra operations with automatic fallback to pure Ruby
- Replaced GSL-specific code with backend detection system supporting both native and pure Ruby implementations
- Updated all documentation, benchmarks, and tests to reflect the new architecture
Reviewed changes
Copilot reviewed 17 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
ext/classifier/classifier_ext.c |
Main entry point for native C extension |
ext/classifier/linalg.h |
Header file defining structures and function declarations |
ext/classifier/vector.c |
Vector implementation with operations like normalize, dot product, sum |
ext/classifier/matrix.c |
Matrix implementation with operations like transpose, multiply, column access |
ext/classifier/svd.c |
Jacobi SVD decomposition algorithm ported from Ruby |
ext/classifier/extconf.rb |
Build configuration for native extension |
lib/classifier/lsi.rb |
Backend detection logic replacing GSL checks |
lib/classifier/lsi/content_node.rb |
Updated to use backend-agnostic vector/matrix classes |
lib/classifier/extensions/vector.rb |
Added normalize method override for Ruby stdlib Vector |
lib/classifier/extensions/vector_serialize.rb |
Removed GSL serialization code (no longer needed) |
test/linalg/native_ext_test.rb |
Comprehensive unit tests for native extension |
test/lsi/lsi_test.rb |
Relaxed test assertions to handle minor numerical differences between backends |
classifier.gemspec |
Added extension configuration and rake-compiler dependency |
Rakefile |
Added compile task for building native extension |
benchmark/lsi_benchmark.rb |
Updated benchmark script to compare native C vs pure Ruby |
README.md |
Updated installation instructions and performance benchmarks |
CLAUDE.md |
Updated documentation to reflect new native extension architecture |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| parts.each do |part| | ||
| assert all_texts.any? { |t| t.include?(part.gsub('This text ', '').split.first) }, | ||
| "Summary part '#{part}' should be from test texts" |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'recieve' to 'receive'.
ext/classifier/svd.c
Outdated
| double sum_diff = 0.0; | ||
| for (size_t i = 0; i < size; i++) { | ||
| double diff = fabs(MAT_AT(q, i, i) - MAT_AT(prev_q, i, i)); | ||
| if (diff > 0.001) { |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convergence threshold value 0.001 is duplicated here and at line 12 as SVD_CONVERGENCE_THRESHOLD. Use the constant instead of the magic number for consistency.
| if (diff > 0.001) { | |
| if (diff > SVD_CONVERGENCE_THRESHOLD) { |
| # Cache magnitude since Vector is immutable after creation | ||
| # Note: We intentionally override the matrix gem's normalize method | ||
| # to provide a more robust implementation that handles zero vectors | ||
| undef_method :normalize if method_defined?(:normalize) |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "We intentionally override the matrix gem's normalize method" but the code uses undef_method which removes the method entirely before redefining it. Consider clarifying that this undefines the existing method first to avoid conflicts, not overrides it directly.
- Use SVD_CONVERGENCE_THRESHOLD constant instead of magic number - Clarify comment about undef_method vs override behavior
Summary
Replace the
rb-gsldependency with a self-contained C extension that implements Vector, Matrix, and Jacobi SVD operations. This eliminates the need for users to install external libraries (like libgsl) while providing significant performance improvements.Performance
Detailed benchmark (20 documents)
Test Plan
NATIVE_VECTOR=true)Files Changed
New:
ext/classifier/- C extension source filestest/linalg/native_ext_test.rb- Unit tests for native extensionModified:
classifier.gemspec- Add extension configurationlib/classifier/lsi.rb- Backend detection logicREADME.md/CLAUDE.md- Documentation updatesRemoved:
lib/classifier/extensions/vector_serialize.rb- GSL serialization (no longer needed)Closes #87