Skip to content

Conversation

@cardmagic
Copy link
Owner

@cardmagic cardmagic commented Dec 28, 2025

Summary

  • Add as_json method that returns a Hash representation
  • Add to_json method that returns a JSON string
  • Add from_json class method that accepts either a JSON string or Hash
  • Add save(path) and load(path) for file operations
  • Use versioned JSON format for future compatibility

Usage

# Train and save to file
classifier = Classifier::Bayes.new('Spam', 'Ham')
classifier.train_spam('buy now cheap')
classifier.save('model.json')

# Load from file and continue training
loaded = Classifier::Bayes.load('model.json')
loaded.train_ham('meeting tomorrow')
loaded.classify('special offer')  # => "Spam"

# Get hash representation
hash = classifier.as_json
# => { version: 1, type: 'bayes', categories: {...}, ... }

# Get JSON string
json_string = classifier.to_json
# => '{"version":1,"type":"bayes",...}'

# Load from JSON string
loaded = Classifier::Bayes.from_json(json_string)

# Load from hash (useful when JSON is already parsed)
loaded = Classifier::Bayes.from_json(hash)

Design Decisions

  • JSON over Marshal: Human-readable, portable, version-safe
  • LSI rebuilds on load: Only source data serialized, not vectors. Makes JSON portable across GSL/non-GSL environments
  • Versioned format: {"version": 1, "type": "bayes|lsi", ...} allows future format changes
  • Accepts both String and Hash: from_json handles both for flexibility

Test plan

  • Round-trip tests for both Bayes and LSI
  • Tests for as_json returning Hash
  • Tests for from_json with both String and Hash
  • Verify classifications match after save/load
  • Test continued training on loaded classifiers
  • All tests pass with bundle exec rake test
  • All tests pass with NATIVE_VECTOR=true bundle exec rake test

Closes #17

@cardmagic cardmagic requested a review from Copilot December 28, 2025 15:50
@cardmagic cardmagic self-assigned this Dec 28, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds JSON-based serialization and file persistence capabilities to the Bayes and LSI classifiers, enabling models to be saved, loaded, and continued training after persistence.

Key changes:

  • Implemented as_json, to_json, from_json, save, and load methods for both Bayes and LSI classifiers
  • Used versioned JSON format for forward compatibility
  • LSI serializes only source data (word_hash, categories) and rebuilds vectors on load for portability

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/test_helper.rb Added tmpdir and json requires for persistence tests
test/lsi/lsi_test.rb Added comprehensive tests for LSI serialization, round-trip persistence, and classification preservation
test/bayes/bayesian_test.rb Added comprehensive tests for Bayes serialization, round-trip persistence, and classification preservation
lib/classifier/lsi.rb Implemented serialization methods for LSI with source data preservation and index rebuilding
lib/classifier/bayes.rb Implemented serialization methods for Bayes with full state restoration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to +132
{
version: 1,
type: 'bayes',
categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) },
total_words: @total_words,
category_counts: @category_counts.transform_keys(&:to_s),
category_word_count: @category_word_count.transform_keys(&:to_s)
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested hash transformations create intermediate objects. Consider using each_with_object to build the hash in a single pass for better performance with large category sets.

Suggested change
{
version: 1,
type: 'bayes',
categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) },
total_words: @total_words,
category_counts: @category_counts.transform_keys(&:to_s),
category_word_count: @category_word_count.transform_keys(&:to_s)
categories = @categories.each_with_object({}) do |(key, value), result|
result[key.to_s] = value.each_with_object({}) do |(inner_key, inner_value), inner_result|
inner_result[inner_key.to_s] = inner_value
end
end
category_counts = @category_counts.each_with_object({}) do |(key, value), result|
result[key.to_s] = value
end
category_word_count = @category_word_count.each_with_object({}) do |(key, value), result|
result[key.to_s] = value
end
{
version: 1,
type: 'bayes',
categories: categories,
total_words: @total_words,
category_counts: category_counts,
category_word_count: category_word_count

Copilot uses AI. Check for mistakes.
Comment on lines +364 to +377
items_data = @items.transform_values do |node|
{
word_hash: node.word_hash.transform_keys(&:to_s),
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested transformations with transform_values and transform_keys create multiple intermediate hashes. For large item sets, consider using each_with_object to build the structure in a single pass.

Suggested change
items_data = @items.transform_values do |node|
{
word_hash: node.word_hash.transform_keys(&:to_s),
items_data = @items.each_with_object({}) do |(item_key, node), acc|
word_hash = {}
node.word_hash.each do |word_key, value|
word_hash[word_key.to_s] = value
end
acc[item_key] = {
word_hash: word_hash,

Copilot uses AI. Check for mistakes.
Provides a cleaner API than raw Marshal for persisting trained
classifiers. Users can now save training state and resume later:

  classifier.save('model.json')
  loaded = Classifier::Bayes.load('model.json')
  loaded.train_spam('more data')  # continue training

Both Bayes and LSI classifiers support:
- to_json / from_json for string serialization
- save(path) / load(path) for file operations

LSI serializes only source data (word_hash, categories), not computed
vectors. The index rebuilds on load, making JSON files portable across
GSL/non-GSL environments.

Closes #17
- Add as_json method that returns a Hash representation
- Modify to_json to use as_json internally
- Modify from_json to accept both String and Hash arguments

This provides more flexibility for serialization workflows.
- Extract restore_state private method to reduce from_json AbcSize
- Change as_json return type to untyped for Steep compatibility
- Use assert_path_exists in tests per Minitest/AssertPathExists
- Add JSON RBS vendor file for type checking
- Regenerate RBS files
@cardmagic cardmagic force-pushed the save-classifier-state branch from 935833e to 39edd45 Compare December 28, 2025 16:13
@cardmagic
Copy link
Owner Author

Regarding the Copilot suggestions to use each_with_object instead of transform_keys/transform_values:

The current implementation using transform_keys and transform_values is more readable and idiomatic Ruby. The performance difference would be negligible for typical classifier use cases (serializing a classifier with thousands of words still takes milliseconds). This is a premature optimization.

The branch has been rebased onto master and now includes the thread-safety changes. Fixed an issue where from_json wasn't initializing the mutex (added mu_initialize to restore_state).

@cardmagic
Copy link
Owner Author

@greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile Summary

Added JSON-based persistence to both Classifier::Bayes and Classifier::LSI with save/load file operations and to_json/from_json serialization methods. Uses versioned format for future compatibility.

Key implementation details:

  • as_json returns Hash representation with version metadata
  • to_json returns JSON string via as_json
  • from_json accepts both JSON strings and pre-parsed hashes
  • save(path) writes JSON to file, load(path) reads and deserializes
  • Bayes: serializes all training data including categories, word counts, and totals
  • LSI: only serializes source data (word_hash, categories), not computed SVD vectors - index rebuilds automatically on load for portability across native/Ruby backends
  • Both properly reinitialize mutex state on deserialization

Testing:

  • Round-trip serialization verified for both classifiers
  • Classifications match after save/load
  • Continued training works on loaded classifiers
  • Tests cover both string and hash input to from_json

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • Implementation is well-designed with proper mutex handling, comprehensive test coverage, versioned format for future compatibility, and smart design decision to rebuild LSI index for portability. Code follows existing patterns and includes thorough documentation.
  • No files require special attention

Important Files Changed

Filename Overview
lib/classifier/bayes.rb Added JSON serialization methods (as_json, to_json, from_json) and file persistence (save, load). Implementation properly handles mutex reinitialization and symbol/string conversions.
lib/classifier/lsi.rb Added JSON serialization with automatic index rebuilding on load. Only serializes source data (word_hash, categories), not computed vectors, ensuring portability across native/Ruby backends.
test/bayes/bayesian_test.rb Added comprehensive tests for save/load functionality including round-trip testing, continued training, and hash/string input validation.
test/lsi/lsi_test.rb Added thorough LSI save/load tests covering classification preservation, search functionality, auto_rebuild setting, and continued item additions.

Sequence Diagram

sequenceDiagram
    participant User
    participant Bayes/LSI
    participant FileSystem
    participant JSON

    Note over User,JSON: Save Flow
    User->>Bayes/LSI: save(path)
    Bayes/LSI->>Bayes/LSI: to_json()
    Bayes/LSI->>Bayes/LSI: as_json()
    Note right of Bayes/LSI: Transform symbols to strings<br/>Serialize state data
    Bayes/LSI->>JSON: to_json
    JSON-->>Bayes/LSI: JSON string
    Bayes/LSI->>FileSystem: File.write(path, json)
    FileSystem-->>User: Success

    Note over User,JSON: Load Flow
    User->>Bayes/LSI: load(path)
    Bayes/LSI->>FileSystem: File.read(path)
    FileSystem-->>Bayes/LSI: JSON string
    Bayes/LSI->>Bayes/LSI: from_json(json)
    Bayes/LSI->>JSON: JSON.parse(json)
    JSON-->>Bayes/LSI: Hash
    Bayes/LSI->>Bayes/LSI: allocate new instance
    alt Bayes
        Bayes/LSI->>Bayes/LSI: restore_state(data)
        Note right of Bayes/LSI: Initialize mutex<br/>Transform strings to symbols<br/>Restore all state variables
    else LSI
        Bayes/LSI->>Bayes/LSI: new(auto_rebuild: false)
        Note right of Bayes/LSI: Create ContentNodes from data<br/>Increment version counter
        Bayes/LSI->>Bayes/LSI: build_index()
        Note right of Bayes/LSI: Rebuild SVD vectors<br/>from source data
    end
    Bayes/LSI-->>User: Loaded classifier
Loading

@cardmagic cardmagic merged commit d3f4771 into master Dec 28, 2025
6 checks passed
@cardmagic cardmagic deleted the save-classifier-state branch December 28, 2025 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Save the current state of training

2 participants