Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

  • Cache training_count and vocab_size in Bayes classifier with dirty-flag invalidation
  • Cache Vector#magnitude since Vector objects are immutable after creation
  • Update CLAUDE.md to use bundle exec rake for commands

Performance Impact

Location Before After
classifications() O(n) sum on every call O(1) cached
classifications() O(n) vocab calculation O(1) cached
Vector#magnitude Always recalculates Cached per instance

Caches are invalidated via dirty-flag pattern when train, untrain, add_category, or remove_category are called.

Test plan

  • All tests pass with bundle exec rake test
  • All tests pass with NATIVE_VECTOR=true bundle exec rake test

Closes #65

@cardmagic cardmagic requested a review from Copilot December 28, 2025 15:50
@cardmagic cardmagic self-assigned this Dec 28, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes performance in the Bayes classifier by implementing caching for expensive computations that were previously recalculated on every call.

  • Caches training_count and vocab_size calculations with dirty-flag invalidation
  • Caches Vector#magnitude leveraging Vector immutability
  • Updates documentation commands to use bundle exec rake

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
lib/classifier/extensions/vector.rb Adds memoization to magnitude method to avoid redundant calculations
lib/classifier/bayes.rb Implements caching for training_count and vocab_size with invalidation on mutations
CLAUDE.md Updates rake commands to use bundle exec prefix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cardmagic
Copy link
Owner Author

@greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile Summary

This PR implements performance optimizations by caching expensive O(n) computations that were recalculated on every method call. The changes follow a clean dirty-flag invalidation pattern and are properly synchronized for thread safety.

Key Changes:

  • Caches training_count (sum of category counts) in Bayes classifier, reducing O(n) to O(1) on each classifications() call
  • Caches vocab_size (unique vocabulary calculation) in Bayes classifier, reducing O(n) to O(1)
  • Caches Vector#magnitude since Vector instances are immutable after creation
  • All caches are invalidated when data changes via train, untrain, add_category, or remove_category
  • Updated CLAUDE.md to use bundle exec rake for all commands (best practice for dependency isolation)

Implementation Quality:

  • Thread-safe: cache access occurs within synchronize blocks
  • Correct naming: memoized variable names follow RuboCop convention (e.g., @cached_training_count matches method cached_training_count)
  • No stale cache risk: all mutation operations properly call invalidate_caches
  • Handles deserialization: marshal_load and restore_state properly initialize caches to nil

The caching strategy is sound and addresses the performance issues identified in #65 without introducing complexity or correctness issues.

Confidence Score: 5/5

  • This PR is safe to merge with no identified risks
  • The caching implementation is straightforward, thread-safe, and well-tested. Previous naming issues have been resolved in commit ccf4925. All mutation operations properly invalidate caches, preventing stale data. The changes are focused performance optimizations that don't alter behavior.
  • No files require special attention

Important Files Changed

Filename Overview
lib/classifier/bayes.rb Added caching for training_count and vocab_size with dirty-flag invalidation pattern, properly synchronized within mutex blocks
lib/classifier/extensions/vector.rb Added magnitude caching for immutable Vector instances using memoization
CLAUDE.md Updated all rake commands to use bundle exec rake prefix for proper dependency isolation

Sequence Diagram

sequenceDiagram
    participant User
    participant Bayes
    participant Cache
    participant Data

    Note over Bayes,Cache: Initial State: caches are nil

    User->>Bayes: train(category, text)
    Bayes->>Cache: invalidate_caches()
    Cache->>Cache: @cached_training_count = nil
    Cache->>Cache: @cached_vocab_size = nil
    Bayes->>Data: Update @category_counts, @categories

    User->>Bayes: classifications(text)
    Bayes->>Bayes: synchronize do
    Bayes->>Cache: cached_training_count()
    alt Cache is nil
        Cache->>Data: @category_counts.values.sum.to_f
        Data-->>Cache: computed value
        Cache->>Cache: @cached_training_count = value
    end
    Cache-->>Bayes: return cached value
    
    Bayes->>Cache: cached_vocab_size()
    alt Cache is nil
        Cache->>Data: @categories.values.flat_map(&:keys).uniq.size
        Data-->>Cache: computed value
        Cache->>Cache: @cached_vocab_size = value
    end
    Cache-->>Bayes: return cached value
    Bayes-->>User: classifications hash

    Note over User,Data: Subsequent calls skip computation

    User->>Bayes: classifications(text) [2nd call]
    Bayes->>Cache: cached_training_count()
    Cache-->>Bayes: return cached value (O(1))
    Bayes->>Cache: cached_vocab_size()
    Cache-->>Bayes: return cached value (O(1))
    Bayes-->>User: classifications hash

    User->>Bayes: untrain(category, text)
    Bayes->>Cache: invalidate_caches()
    Cache->>Cache: caches set to nil
    Bayes->>Data: Update data structures
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Addresses issue #65 by memoizing O(n) operations that were recalculated
on every classification call:

- Cache training_count (sum of category counts) with dirty-flag pattern
- Cache vocab_size (unique words across categories) with invalidation
- Cache Vector#magnitude since Vector objects are immutable

The dirty-flag pattern ensures caches are invalidated when training data
changes via train, untrain, add_category, or remove_category calls.

Also updates CLAUDE.md to use bundle exec for rake commands.

Closes #65
@cardmagic cardmagic force-pushed the cache-expensive-computations branch from 2cb82bb to 3eae487 Compare December 28, 2025 16:27
@cardmagic
Copy link
Owner Author

@greptileai review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

RuboCop's Naming/MemoizedInstanceVariableName requires the memoized
instance variable to match the method name. Update:

- @training_count_cache -> @cached_training_count
- @vocab_size_cache -> @cached_vocab_size
- @magnitude_cache -> @magnitude
@cardmagic cardmagic force-pushed the cache-expensive-computations branch from 3eae487 to ccf4925 Compare December 28, 2025 16:36
@cardmagic
Copy link
Owner Author

Fixed the variable naming inconsistency - now using @cached_training_count and @cached_vocab_size consistently in marshal_load and restore_state.

@cardmagic
Copy link
Owner Author

@greptileai review

@cardmagic cardmagic merged commit 661411d into master Dec 28, 2025
6 checks passed
@cardmagic cardmagic deleted the cache-expensive-computations branch December 28, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache expensive computations that are recalculated on every call

2 participants