Skip to content

Conversation

@samyron
Copy link
Contributor

@samyron samyron commented Dec 31, 2025

This PR contains 4 commits which optimize json_string_unescape. Admittedly, I almost didn't create this PR as I don't love the Apple Aarch64-specific code... however I decided to leave it up to the maintainers of this library to decide if this code is worth accepting.

I omitted the SWAR / ARM Neon unless the code is being compiled on Apple platforms as a quick benchmark on my x86 laptop with the SWAR find_backslash and custom json_memcpy code showed it was no faster master. Looking at the memcpy and memchr on Ubuntu showed a AVX optimized libc. At the time the code was structured a bit differently and it was only copying sizeof(uint64_t) bytes at a time on x86. I could try implementing find_backslash and json_memcpy using SSE2 if desired. I should note that on memchr on macOS does seem to be optimized using ARM Neon instructions. Looking at the assembly, the implementation is similar though not quite the same. I think the performance improvement comes from reduced function call overhead / branching as find_backslashes ends up inlined.

With respect to keeping track of the additional_backslashes instead of has_more, if the string to unescape ends with a long run of characters that do not have any more backslashes, we can save a memchr and directly copy to the end of the string.

I'm cool if you decide not to accept any of these and simply close this PR. These are just some random things I've been testing out.

Benchmarks

These were run on an Macbook Air M1.

With only the addtional_backslashes commit:

The benchmark:

benchmark_parsing 'a bunch of escapes followed by a long string without escapes', JSON.generate("some\nescapes\twith\"a long\"string after\fso\twe\ncan test\nno\ffinal\tmemchr" * 4 + "this is some longish string! " * 512)
== Parsing a bunch of escapes followed by a long string without escapes (15162 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    33.105k i/100ms
Calculating -------------------------------------
               after    345.312k (± 2.3%) i/s    (2.90 μs/i) -      1.755M in   5.083936s

Comparison:
              before:   307715.4 i/s
               after:   345311.9 i/s - 1.12x  faster

additional_backslashes and find_backslashes:

== Parsing a bunch of escapes followed by a long string without escapes (15162 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    32.551k i/100ms
Calculating -------------------------------------
               after    351.976k (± 2.6%) i/s    (2.84 μs/i) -      1.790M in   5.089949s

Comparison:
              before:   307223.8 i/s
               after:   351976.5 i/s - 1.15x  faster


== Parsing activitypub.json (58160 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.094k i/100ms
Calculating -------------------------------------
               after     10.824k (± 0.8%) i/s   (92.39 μs/i) -     54.700k in   5.054133s

Comparison:
              before:    10415.9 i/s
               after:    10823.5 i/s - 1.04x  faster

additional_backslashes, find_backslashes, json_memcpy:

== Parsing a bunch of escapes followed by a long string without escapes (15162 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    32.384k i/100ms
Calculating -------------------------------------
               after    342.436k (± 3.2%) i/s    (2.92 μs/i) -      1.716M in   5.017785s

Comparison:
              before:   288084.1 i/s
               after:   342436.2 i/s - 1.19x  faster


== Parsing activitypub.json (58160 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.128k i/100ms
Calculating -------------------------------------
               after     11.272k (± 0.7%) i/s   (88.71 μs/i) -     56.400k in   5.003607s

Comparison:
              before:    10586.3 i/s
               after:    11272.4 i/s - 1.06x  faster

additional_backslashes, find_backslashes, json_memcpy, unescape_unicode:

== Parsing activitypub.json (58160 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.037k i/100ms
Calculating -------------------------------------
               after     11.368k (± 0.6%) i/s   (87.96 μs/i) -     57.035k in   5.017163s

Comparison:
              before:    10600.4 i/s
               after:    11368.4 i/s - 1.07x  faster

Comment on lines +482 to +485
signed char b0 = digit_values[p[0]];
signed char b1 = digit_values[p[1]];
signed char b2 = digit_values[p[2]];
signed char b3 = digit_values[p[3]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a risk of reading past the end of the buffer here.

Prior digit_values[p[0]] < 0 would return if p[0] is a NULL byte.

I think this optimization does make sense, but we need to ensure we can actually read 4 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I checked all the callsites, they all already ensure that.

So I think I'm good with that optimization, except I think I'd like to refactor the bound check inside the function.

@byroot
Copy link
Member

byroot commented Dec 31, 2025

as I don't love the Apple Aarch64-specific code... however I decided to leave it up to the maintainers of this library to decide if this code is worth accepting.

To be frank with you, my opinion on the matter is that Apple hardware is overwhelmingly used in development, almost never in production, so it's not really worth taking on extra complexity just for that environment, unless of course the gain is massive.

I do however like your two optimizations that aren't specific to macOS+aarch64.

@byroot
Copy link
Member

byroot commented Dec 31, 2025

Alright, I cherry picked d21d936 and 976ba36 on master.

Thank you for the PR!

@byroot byroot closed this Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants