-
Notifications
You must be signed in to change notification settings - Fork 122
feat: add save/load methods for classifier persistence #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds JSON-based serialization and file persistence capabilities to the Bayes and LSI classifiers, enabling models to be saved, loaded, and continued training after persistence.
Key changes:
- Implemented
as_json,to_json,from_json,save, andloadmethods for both Bayes and LSI classifiers - Used versioned JSON format for forward compatibility
- LSI serializes only source data (word_hash, categories) and rebuilds vectors on load for portability
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_helper.rb | Added tmpdir and json requires for persistence tests |
| test/lsi/lsi_test.rb | Added comprehensive tests for LSI serialization, round-trip persistence, and classification preservation |
| test/bayes/bayesian_test.rb | Added comprehensive tests for Bayes serialization, round-trip persistence, and classification preservation |
| lib/classifier/lsi.rb | Implemented serialization methods for LSI with source data preservation and index rebuilding |
| lib/classifier/bayes.rb | Implemented serialization methods for Bayes with full state restoration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| version: 1, | ||
| type: 'bayes', | ||
| categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) }, | ||
| total_words: @total_words, | ||
| category_counts: @category_counts.transform_keys(&:to_s), | ||
| category_word_count: @category_word_count.transform_keys(&:to_s) |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested hash transformations create intermediate objects. Consider using each_with_object to build the hash in a single pass for better performance with large category sets.
| { | |
| version: 1, | |
| type: 'bayes', | |
| categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) }, | |
| total_words: @total_words, | |
| category_counts: @category_counts.transform_keys(&:to_s), | |
| category_word_count: @category_word_count.transform_keys(&:to_s) | |
| categories = @categories.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value.each_with_object({}) do |(inner_key, inner_value), inner_result| | |
| inner_result[inner_key.to_s] = inner_value | |
| end | |
| end | |
| category_counts = @category_counts.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value | |
| end | |
| category_word_count = @category_word_count.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value | |
| end | |
| { | |
| version: 1, | |
| type: 'bayes', | |
| categories: categories, | |
| total_words: @total_words, | |
| category_counts: category_counts, | |
| category_word_count: category_word_count |
| items_data = @items.transform_values do |node| | ||
| { | ||
| word_hash: node.word_hash.transform_keys(&:to_s), |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested transformations with transform_values and transform_keys create multiple intermediate hashes. For large item sets, consider using each_with_object to build the structure in a single pass.
| items_data = @items.transform_values do |node| | |
| { | |
| word_hash: node.word_hash.transform_keys(&:to_s), | |
| items_data = @items.each_with_object({}) do |(item_key, node), acc| | |
| word_hash = {} | |
| node.word_hash.each do |word_key, value| | |
| word_hash[word_key.to_s] = value | |
| end | |
| acc[item_key] = { | |
| word_hash: word_hash, |
Provides a cleaner API than raw Marshal for persisting trained
classifiers. Users can now save training state and resume later:
classifier.save('model.json')
loaded = Classifier::Bayes.load('model.json')
loaded.train_spam('more data') # continue training
Both Bayes and LSI classifiers support:
- to_json / from_json for string serialization
- save(path) / load(path) for file operations
LSI serializes only source data (word_hash, categories), not computed
vectors. The index rebuilds on load, making JSON files portable across
GSL/non-GSL environments.
Closes #17
- Add as_json method that returns a Hash representation - Modify to_json to use as_json internally - Modify from_json to accept both String and Hash arguments This provides more flexibility for serialization workflows.
- Extract restore_state private method to reduce from_json AbcSize - Change as_json return type to untyped for Steep compatibility - Use assert_path_exists in tests per Minitest/AssertPathExists - Add JSON RBS vendor file for type checking - Regenerate RBS files
935833e to
39edd45
Compare
|
Regarding the Copilot suggestions to use The current implementation using The branch has been rebased onto master and now includes the thread-safety changes. Fixed an issue where |
Greptile SummaryAdded JSON-based persistence to both Key implementation details:
Testing:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Bayes/LSI
participant FileSystem
participant JSON
Note over User,JSON: Save Flow
User->>Bayes/LSI: save(path)
Bayes/LSI->>Bayes/LSI: to_json()
Bayes/LSI->>Bayes/LSI: as_json()
Note right of Bayes/LSI: Transform symbols to strings<br/>Serialize state data
Bayes/LSI->>JSON: to_json
JSON-->>Bayes/LSI: JSON string
Bayes/LSI->>FileSystem: File.write(path, json)
FileSystem-->>User: Success
Note over User,JSON: Load Flow
User->>Bayes/LSI: load(path)
Bayes/LSI->>FileSystem: File.read(path)
FileSystem-->>Bayes/LSI: JSON string
Bayes/LSI->>Bayes/LSI: from_json(json)
Bayes/LSI->>JSON: JSON.parse(json)
JSON-->>Bayes/LSI: Hash
Bayes/LSI->>Bayes/LSI: allocate new instance
alt Bayes
Bayes/LSI->>Bayes/LSI: restore_state(data)
Note right of Bayes/LSI: Initialize mutex<br/>Transform strings to symbols<br/>Restore all state variables
else LSI
Bayes/LSI->>Bayes/LSI: new(auto_rebuild: false)
Note right of Bayes/LSI: Create ContentNodes from data<br/>Increment version counter
Bayes/LSI->>Bayes/LSI: build_index()
Note right of Bayes/LSI: Rebuild SVD vectors<br/>from source data
end
Bayes/LSI-->>User: Loaded classifier
|
Summary
as_jsonmethod that returns a Hash representationto_jsonmethod that returns a JSON stringfrom_jsonclass method that accepts either a JSON string or Hashsave(path)andload(path)for file operationsUsage
Design Decisions
{"version": 1, "type": "bayes|lsi", ...}allows future format changesfrom_jsonhandles both for flexibilityTest plan
as_jsonreturning Hashfrom_jsonwith both String and Hashbundle exec rake testNATIVE_VECTOR=true bundle exec rake testCloses #17