feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The filter setting mechanism in the embedding linker was inefficient, causing significant performance bottlenecks when processing datasets with many small documents that with individual filters. This was particularly problematic during metric calculation for the COMETA dataset, where filter setup dominated the overall runtime.
Performance Impact
Before: Running with the
spacytokenizer took 14,671 seconds for COMETAAfter: Took 205.7 seconds with
spacyfor COMETASpeedup: ~70x improvement
Changes Made
_initialize_filter_structures):_cui_idx_to_name_idxs: maps CUI indices to lists of name indices containing them_has_cuis_all_cachedto avoid recomputation_get_include_filters_1cui: Single CUI include filter using inverted index_get_include_filters_multi_cui: Multi-CUI include filter with NumPy concatenation_get_exclude_filters_1cui/_get_exclude_filters_multi_cui: Corresponding exclude filters_get_include_filtersand_get_exclude_filterschoose appropriate implementationChecks to make sure this doesn't change behaviour
I ran the metrics on both COMETA and the Linking Challenge datasets before and after. The precision/recall/F1 are identical. So I'm fairly confident the changed hasn't messed anything up.