-
Notifications
You must be signed in to change notification settings - Fork 4
feat: add message chunking to sdk #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat: add message chunking to sdk #80
Conversation
Coverage Report
|
||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request centralizes message chunking logic into the SDK to eliminate duplication across worker-cli and orb-discovery. The implementation provides a configurable chunk size parameter with a default of 3.0 MB and uses greedy bin-packing to split entities into appropriately-sized chunks for gRPC ingestion.
Changes:
- Added chunking module with
create_message_chunks()andestimate_message_size()functions - Exported chunking functions from the SDK's public API
- Added comprehensive test suite covering edge cases including empty lists, single/multiple chunks, custom chunk sizes, order preservation, and large entities
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| netboxlabs/diode/sdk/chunking.py | New module implementing greedy bin-packing chunking algorithm with size estimation |
| netboxlabs/diode/sdk/init.py | Exports chunking functions to SDK's public API |
| tests/test_chunking.py | Comprehensive test suite with 10 test cases covering various chunking scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mfiedorowicz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ldrozdz93 , generally it's advisable to create GitHub issue first so we can track, triage and prioritise issues properly. Nevertheless I see it valuable, the main ask here is to document chunking in the README. Additionally, we would like to have functional parity in Diode Go SDK too, hence could you please add feature request issue there?
| This function chunks entities to ensure each chunk stays under the specified | ||
| size limit. It uses a greedy bin-packing algorithm that accumulates entities | ||
| until adding the next entity would exceed the limit, then starts a new chunk. | ||
| The default chunk size of 3.0 MB provides a safe margin below the gRPC 4 MB | ||
| message size limit, accounting for protobuf serialization overhead. | ||
| Args: | ||
| entities: Iterable of Entity protobuf messages to chunk | ||
| max_chunk_size_mb: Maximum chunk size in MB (default 3.0) | ||
| Returns: | ||
| List of entity chunks, each under max_chunk_size_mb. Returns at least | ||
| one chunk even if the input is empty. | ||
| Examples: | ||
| >>> entities = [entity1, entity2, entity3, ...] | ||
| >>> chunks = create_message_chunks(entities) | ||
| >>> for chunk in chunks: | ||
| ... client.ingest(chunk) | ||
| >>> # Use a custom chunk size | ||
| >>> chunks = create_message_chunks(entities, max_chunk_size_mb=3.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see some basic documentation of chunking and how to use it in the README as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check
|
Mind failing linting issues: |
Thanks for the feedback @mfiedorowicz. I'm aware this is kind of a shortcut. A customer is actively blocked by this and I wanted to make it quick. I'll create a FR for Go SDK and add docs. |
README.md
Outdated
|
|
||
| # Decide whether chunking is needed | ||
| if size_mb > 3.0: | ||
| chunks = create_message_chunks(entities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, how to ingest this chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typo. Check now.
|
@mfiedorowicz ready for review |
|
@mfiedorowicz I've created a feature request for diode-sdk-go. Let me know if that's not what you meant. |
Problem
Message chunking logic was duplicated across worker-cli and orb-discovery, using a heuristic algorithm. It turned out to not be enough for a customer's dataset.
Solution
Centralized chunking logic in the SDK with a configurable chunk size parameter (default: 3.0 MB). The implementation uses greedy bin-packing that accumulates entities until adding the next entity would exceed the size limit, then starts a new chunk.
Changes
netboxlabs/diode/sdk/chunking.pywithcreate_message_chunks()andestimate_message_size()tests/test_chunking.py) with 10 test cases covering edge cases