Fixed multiple tests in JsonIngestionFromAvroQueriesTest #17377
+190
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #17376
Motivation
Multiple tests in
JsonIngestionFromAvroQueriesTestexhibit non-deterministic failures due to a combination of issues related to HashMap iteration order and incorrect dictionary lookup behavior:Dictionary lookup silently returns incorrect values from row 0 on unsuccessful key match instead of failing. SegmentDictionaryCreator uses FastUtil's Double2IntOpenHashMap, Float2IntOpenHashMap, Int2IntOpenHashMap, Long2IntOpenHashMap, and Object2IntOpenHashMap for dictionary lookups. According to FastUtil documentation, these maps return a default value of 0 when a key is not found via get() or getInt(). This causes data corruption: when a lookup fails, the code silently uses dictionary ID 0, causing row 0's value to incorrectly appear in other rows.
Non-deterministic HashMap iteration in Avro deserialization.
Apache Avro's
GenericDatumReaderuses HashMap internally (viaGenericData.newMap()), which provides no iteration order guarantees per the HashMap specification. Serializing and deserializing the Avro records containing Maps may result in different key orders across ingestion passes (on storage and indexing).JSON key-value pair ordering is not semantically significant.
Per the JSON specification, "an object is an unordered set of name/value pairs". String comparison of JSON objects with different key orderings incorrectly treats
{"a":"1","b":"2"}and{"b":"2","a":"1"}as distinct values, when they are semantically equivalent.The ordering for each of these order-related problems can change due to different environments producing the contents in different orders despite the logical contents being the same. This harmless re-ordering can surface as test failures and allow incorrect dictionary values to be silently used.
Modifications
SegmentDictionaryCreator.java
The FastUtil dictionary maps have been explicitly configured to return -1 on missing keys to prevent silent corruption. Dictionary lookups have been wrapped in
checkIdxwhich returns the dictionary ID when found, attempts JSON-aware normalization forstringvalues, and throws a descriptiveIllegalStateExceptionif no match exists. JSON-aware comparison utilities using Jackson have also been added which perform structural comparison viaJsonNode.equals()to ignore object key order and to preserve array ordering semantics.JsonIngestionFromAvroQueriesTest
Added
assertJsonEquals()to compare JSON values structurally rather than lexically.testSimpleSelectOnJsonColumn()andtestComplexSelectOnJsoncolumn()were refactored to apply column structural comparison only to JSON columns and to use column-level assertions (if needed).In essence, these changes keep the spirit of the original code while eliminating silent corruptions of data and failures caused by allowed (but previously unexpected) reordering.
Verifying this change
mvn clean install -Pbin-distlocally.Fixed tests
JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumnJsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumnJsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn