Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377

LucasEby · 2025-12-16T03:06:08Z

Motivation

Multiple tests in JsonIngestionFromAvroQueriesTest exhibit non-deterministic failures due to a combination of issues related to HashMap iteration order and incorrect dictionary lookup behavior:

Dictionary lookup silently returns incorrect values from row 0 on unsuccessful key match instead of failing. SegmentDictionaryCreator uses FastUtil's Double2IntOpenHashMap, Float2IntOpenHashMap, Int2IntOpenHashMap, Long2IntOpenHashMap, and Object2IntOpenHashMap for dictionary lookups. According to FastUtil documentation, these maps return a default value of 0 when a key is not found via get() or getInt(). This causes data corruption: when a lookup fails, the code silently uses dictionary ID 0, causing row 0's value to incorrectly appear in other rows.
Non-deterministic HashMap iteration in Avro deserialization.
Apache Avro's GenericDatumReader uses HashMap internally (via GenericData.newMap()), which provides no iteration order guarantees per the HashMap specification. Serializing and deserializing the Avro records containing Maps may result in different key orders across ingestion passes (on storage and indexing).
JSON key-value pair ordering is not semantically significant.
Per the JSON specification, "an object is an unordered set of name/value pairs". String comparison of JSON objects with different key orderings incorrectly treats {"a":"1","b":"2"} and {"b":"2","a":"1"} as distinct values, when they are semantically equivalent.

The ordering for each of these order-related problems can change due to different environments producing the contents in different orders despite the logical contents being the same. This harmless re-ordering can surface as test failures and allow incorrect dictionary values to be silently used.

Modifications

SegmentDictionaryCreator.java

The FastUtil dictionary maps have been explicitly configured to return -1 on missing keys to prevent silent corruption. Dictionary lookups have been wrapped in checkIdx which returns the dictionary ID when found, attempts JSON-aware normalization for string values, and throws a descriptive IllegalStateException if no match exists. JSON-aware comparison utilities using Jackson have also been added which perform structural comparison via JsonNode.equals() to ignore object key order and to preserve array ordering semantics.

JsonIngestionFromAvroQueriesTest

Added assertJsonEquals() to compare JSON values structurally rather than lexically. testSimpleSelectOnJsonColumn() and testComplexSelectOnJsoncolumn() were refactored to apply column structural comparison only to JSON columns and to use column-level assertions (if needed).

In essence, these changes keep the spirit of the original code while eliminating silent corruptions of data and failures caused by allowed (but previously unexpected) reordering.

Verifying this change

Make sure that the change passes all checks by running mvn clean install -Pbin-dist locally.

Fixed tests

JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumn
JsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumn
JsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn

Jackie-Jiang · 2025-12-19T23:11:29Z

Thanks for the contribution!
Currently SegmentDictionaryCreator has assumptions that all values must have already been added before the lookup. There is no check for performance reason. Do you see it being violated anywhere? If it is violated just in the test, let's fix the test instead.
We can also add some javadoc in SegmentDictionaryCreator to explain the assumptions

codecov-commenter · 2025-12-19T23:51:48Z

Codecov Report

❌ Patch coverage is 54.09836% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.22%. Comparing base (42715fc) to head (5d5a546).
⚠️ Report is 23 commits behind head on master.

Files with missing lines	Patch %	Lines
...segment/creator/impl/SegmentDictionaryCreator.java	54.09%	27 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17377      +/-   ##
============================================
+ Coverage     63.21%   63.22%   +0.01%     
  Complexity     1474     1474              
============================================
  Files          3147     3147              
  Lines        187562   187624      +62     
  Branches      28712    28718       +6     
============================================
+ Hits         118560   118629      +69     
+ Misses        59808    59803       -5     
+ Partials       9194     9192       -2

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.17% <54.09%> (-0.03%)`	⬇️
java-21	`63.20% <54.09%> (+0.01%)`	⬆️
temurin	`63.22% <54.09%> (+0.01%)`	⬆️
unittests	`63.22% <54.09%> (+0.01%)`	⬆️
unittests1	`55.62% <54.09%> (+<0.01%)`	⬆️
unittests2	`33.90% <54.09%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

LucasEby · 2025-12-22T17:40:03Z

@Jackie-Jiang I see where you are coming from. Thank you for your suggestions. I am currently away on vacation for about a week. When I get back I will address your feedback and see if some sort of compromise can be reached which does not compromise performance.

Fixed multiple tests in JsonIngestionFromAvroQueriesTest

5d5a546

Jackie-Jiang added flaky-test bugfix labels Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377

Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377

LucasEby commented Dec 16, 2025

Uh oh!

Jackie-Jiang commented Dec 19, 2025

Uh oh!

codecov-commenter commented Dec 19, 2025 •

edited

Loading

Uh oh!

LucasEby commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377

Are you sure you want to change the base?

Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377

Conversation

LucasEby commented Dec 16, 2025

Motivation

Modifications

SegmentDictionaryCreator.java

JsonIngestionFromAvroQueriesTest

Verifying this change

Uh oh!

Jackie-Jiang commented Dec 19, 2025

Uh oh!

codecov-commenter commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LucasEby commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Dec 19, 2025 •

edited

Loading