-
Notifications
You must be signed in to change notification settings - Fork 122
perf: cache expensive computations in Bayes classifier #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes performance in the Bayes classifier by implementing caching for expensive computations that were previously recalculated on every call.
- Caches
training_countandvocab_sizecalculations with dirty-flag invalidation - Caches
Vector#magnitudeleveraging Vector immutability - Updates documentation commands to use
bundle exec rake
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| lib/classifier/extensions/vector.rb | Adds memoization to magnitude method to avoid redundant calculations |
| lib/classifier/bayes.rb | Implements caching for training_count and vocab_size with invalidation on mutations |
| CLAUDE.md | Updates rake commands to use bundle exec prefix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR implements performance optimizations by caching expensive O(n) computations that were recalculated on every method call. The changes follow a clean dirty-flag invalidation pattern and are properly synchronized for thread safety. Key Changes:
Implementation Quality:
The caching strategy is sound and addresses the performance issues identified in #65 without introducing complexity or correctness issues. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Bayes
participant Cache
participant Data
Note over Bayes,Cache: Initial State: caches are nil
User->>Bayes: train(category, text)
Bayes->>Cache: invalidate_caches()
Cache->>Cache: @cached_training_count = nil
Cache->>Cache: @cached_vocab_size = nil
Bayes->>Data: Update @category_counts, @categories
User->>Bayes: classifications(text)
Bayes->>Bayes: synchronize do
Bayes->>Cache: cached_training_count()
alt Cache is nil
Cache->>Data: @category_counts.values.sum.to_f
Data-->>Cache: computed value
Cache->>Cache: @cached_training_count = value
end
Cache-->>Bayes: return cached value
Bayes->>Cache: cached_vocab_size()
alt Cache is nil
Cache->>Data: @categories.values.flat_map(&:keys).uniq.size
Data-->>Cache: computed value
Cache->>Cache: @cached_vocab_size = value
end
Cache-->>Bayes: return cached value
Bayes-->>User: classifications hash
Note over User,Data: Subsequent calls skip computation
User->>Bayes: classifications(text) [2nd call]
Bayes->>Cache: cached_training_count()
Cache-->>Bayes: return cached value (O(1))
Bayes->>Cache: cached_vocab_size()
Cache-->>Bayes: return cached value (O(1))
Bayes-->>User: classifications hash
User->>Bayes: untrain(category, text)
Bayes->>Cache: invalidate_caches()
Cache->>Cache: caches set to nil
Bayes->>Data: Update data structures
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Addresses issue #65 by memoizing O(n) operations that were recalculated on every classification call: - Cache training_count (sum of category counts) with dirty-flag pattern - Cache vocab_size (unique words across categories) with invalidation - Cache Vector#magnitude since Vector objects are immutable The dirty-flag pattern ensures caches are invalidated when training data changes via train, untrain, add_category, or remove_category calls. Also updates CLAUDE.md to use bundle exec for rake commands. Closes #65
2cb82bb to
3eae487
Compare
|
@greptileai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 2 comments
RuboCop's Naming/MemoizedInstanceVariableName requires the memoized instance variable to match the method name. Update: - @training_count_cache -> @cached_training_count - @vocab_size_cache -> @cached_vocab_size - @magnitude_cache -> @magnitude
3eae487 to
ccf4925
Compare
|
Fixed the variable naming inconsistency - now using |
|
@greptileai review |
Summary
training_countandvocab_sizein Bayes classifier with dirty-flag invalidationVector#magnitudesince Vector objects are immutable after creationbundle exec rakefor commandsPerformance Impact
classifications()classifications()Vector#magnitudeCaches are invalidated via dirty-flag pattern when
train,untrain,add_category, orremove_categoryare called.Test plan
bundle exec rake testNATIVE_VECTOR=true bundle exec rake testCloses #65