Skip to content

Conversation

@LucasEby
Copy link

Fixes #17376

Motivation

Multiple tests in JsonIngestionFromAvroQueriesTest exhibit non-deterministic failures due to a combination of issues related to HashMap iteration order and incorrect dictionary lookup behavior:

  1. Dictionary lookup silently returns incorrect values from row 0 on unsuccessful key match instead of failing. SegmentDictionaryCreator uses FastUtil's Double2IntOpenHashMap, Float2IntOpenHashMap, Int2IntOpenHashMap, Long2IntOpenHashMap, and Object2IntOpenHashMap for dictionary lookups. According to FastUtil documentation, these maps return a default value of 0 when a key is not found via get() or getInt(). This causes data corruption: when a lookup fails, the code silently uses dictionary ID 0, causing row 0's value to incorrectly appear in other rows.

  2. Non-deterministic HashMap iteration in Avro deserialization.
    Apache Avro's GenericDatumReader uses HashMap internally (via GenericData.newMap()), which provides no iteration order guarantees per the HashMap specification. Serializing and deserializing the Avro records containing Maps may result in different key orders across ingestion passes (on storage and indexing).

  3. JSON key-value pair ordering is not semantically significant.
    Per the JSON specification, "an object is an unordered set of name/value pairs". String comparison of JSON objects with different key orderings incorrectly treats {"a":"1","b":"2"} and {"b":"2","a":"1"} as distinct values, when they are semantically equivalent.

The ordering for each of these order-related problems can change due to different environments producing the contents in different orders despite the logical contents being the same. This harmless re-ordering can surface as test failures and allow incorrect dictionary values to be silently used.

Modifications

SegmentDictionaryCreator.java

The FastUtil dictionary maps have been explicitly configured to return -1 on missing keys to prevent silent corruption. Dictionary lookups have been wrapped in checkIdx which returns the dictionary ID when found, attempts JSON-aware normalization for string values, and throws a descriptive IllegalStateException if no match exists. JSON-aware comparison utilities using Jackson have also been added which perform structural comparison via JsonNode.equals() to ignore object key order and to preserve array ordering semantics.

JsonIngestionFromAvroQueriesTest

Added assertJsonEquals() to compare JSON values structurally rather than lexically. testSimpleSelectOnJsonColumn() and testComplexSelectOnJsoncolumn() were refactored to apply column structural comparison only to JSON columns and to use column-level assertions (if needed).

In essence, these changes keep the spirit of the original code while eliminating silent corruptions of data and failures caused by allowed (but previously unexpected) reordering.

Verifying this change

  • Make sure that the change passes all checks by running mvn clean install -Pbin-dist locally.

Fixed tests

  • JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumn
  • JsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumn
  • JsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn

@Jackie-Jiang
Copy link
Contributor

Thanks for the contribution!
Currently SegmentDictionaryCreator has assumptions that all values must have already been added before the lookup. There is no check for performance reason. Do you see it being violated anywhere? If it is violated just in the test, let's fix the test instead.
We can also add some javadoc in SegmentDictionaryCreator to explain the assumptions

@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 54.09836% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.22%. Comparing base (42715fc) to head (5d5a546).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
...segment/creator/impl/SegmentDictionaryCreator.java 54.09% 27 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17377      +/-   ##
============================================
+ Coverage     63.21%   63.22%   +0.01%     
  Complexity     1474     1474              
============================================
  Files          3147     3147              
  Lines        187562   187624      +62     
  Branches      28712    28718       +6     
============================================
+ Hits         118560   118629      +69     
+ Misses        59808    59803       -5     
+ Partials       9194     9192       -2     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.17% <54.09%> (-0.03%) ⬇️
java-21 63.20% <54.09%> (+0.01%) ⬆️
temurin 63.22% <54.09%> (+0.01%) ⬆️
unittests 63.22% <54.09%> (+0.01%) ⬆️
unittests1 55.62% <54.09%> (+<0.01%) ⬆️
unittests2 33.90% <54.09%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@LucasEby
Copy link
Author

@Jackie-Jiang I see where you are coming from. Thank you for your suggestions. I am currently away on vacation for about a week. When I get back I will address your feedback and see if some sort of compromise can be reached which does not compromise performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect Assumptions on Ordering and Row Pollution Causing Multiple JsonIngestionFromAvroQueriesTest Test Failures

3 participants