Skip to content

Conversation

@janibakin
Copy link

Fixes #755

  • Drain pipeline queue before destruction (removes SIGABRT on aarch64).
  • Lower ctl contention iterations to 200 k on aarch64.

All benchmarks pass on aarch64.
Tested on Raspberry Pi 4 with 2GB as well as VM with 3GB – no swap growth.

@rocstreaming-bot
Copy link

🤖 Welcome! Thanks for your interest in contributing to the project!

Here is a short check-list to help you get started:

Creating pull request

  • Target PR to develop branch.
  • Include link to related issue in PR description.
  • Ensure all CI checks pass.

Code review

  • Mark PR as draft until it's ready. When ready, undraft and request review.
  • Don't resolve discussions by yourself, instead leave a comment or thumbs up.
  • Re-request review after addressing all discussions.

Refer to contribution guidelines for futher details.

@rocstreaming-bot rocstreaming-bot added contrib PR not by a maintainer S-work-in-progress status: PR is still in progress and changing labels May 30, 2025
@gavv gavv marked this pull request as ready for review June 13, 2025 02:19
@rocstreaming-bot rocstreaming-bot added S-ready-for-review status: PR can be reviewed and removed S-work-in-progress status: PR is still in progress and changing labels Jun 13, 2025
@gavv gavv added S-review-in-progress status: PR is being reviewed and removed S-ready-for-review status: PR can be reviewed labels Jun 13, 2025
@gavv
Copy link
Member

gavv commented Jun 15, 2025

Hi!

Thanks for looking into this, the approach with reducing benchmark parameters looks fine. So far I've tested your branch on 2 boxes, 32-bit RPi3B and 64-bit RPI4B, and I needed a few extra modifications:

diff --git a/src/tests/roc_core/bench_mpsc_queue.cpp b/src/tests/roc_core/bench_mpsc_queue.cpp
index 0e8859f5..70b07f72 100644
--- a/src/tests/roc_core/bench_mpsc_queue.cpp
+++ b/src/tests/roc_core/bench_mpsc_queue.cpp
@@ -16,7 +16,7 @@ namespace roc {
 namespace core {
 namespace {
 
-enum { BatchSize = 10000, NumIterations = 5000000, NumThreads = 16 };
+enum { BatchSize = 10000, NumIterations = 500000, NumThreads = 4 };
 
 #if defined(ROC_BENCHMARK_USE_ACCESSORS)
 inline int get_thread_index(const benchmark::State& state) {
@@ -150,8 +150,6 @@ BENCHMARK_REGISTER_F(BM_MpscQueue, TryPopFront)
     ->Arg(1)
     ->Arg(2)
     ->Arg(4)
-    ->Arg(8)
-    ->Arg(16)
     ->Iterations(NumIterations)
     ->Unit(benchmark::kMicrosecond);
 
@@ -184,8 +182,6 @@ BENCHMARK_REGISTER_F(BM_MpscQueue, PopFront)
     ->Arg(1)
     ->Arg(2)
     ->Arg(4)
-    ->Arg(8)
-    ->Arg(16)
     ->Iterations(NumIterations)
     ->Unit(benchmark::kMicrosecond);
 
diff --git a/src/tests/roc_ctl/bench_task_queue_contention.cpp b/src/tests/roc_ctl/bench_task_queue_contention.cpp
index f1f8452d..5438654f 100644
--- a/src/tests/roc_ctl/bench_task_queue_contention.cpp
+++ b/src/tests/roc_ctl/bench_task_queue_contention.cpp
@@ -17,9 +17,9 @@ namespace ctl {
 namespace {
 
 enum {
-    NumScheduleIterations = 2000000,
-    NumScheduleAfterIterations = 20000,
-    NumThreads = 8,
+    NumScheduleIterations = 200000,
+    NumScheduleAfterIterations = 5000,
+    NumThreads = 4,
     BatchSize = 1000
 };
 
diff --git a/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp b/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
index 57b7be0b..00b32c9f 100644
--- a/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
+++ b/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
@@ -31,7 +31,7 @@ namespace {
 enum {
     SampleRate = 1000000, // 1 sample = 1 us (for convenience)
     Chans = 0x1,
-    NumThreads = 16,
+    NumThreads = 4,
     NumIterations = 1000000,
     BatchSize = 10000,
     FrameBufSize = 100

Can you please include those in your PR too?


Also, I suggest to make this solution a bit more generic.

First, arch-specific ifdefs (like __aarch64__) may lead to a mess in future if every benchmark would have ifdefs for different CPUs.

I suggest introducing 3 benchmark profiles: LARGE (for workstation-class computers and larger), MEDIUM (for powerfull SBCs like raspberry pi), and SMALL (for weak CPUs like e.g. MIPS routers).

Then in specific benchmarks we can do something like:

#ifdef ROC_BENCHMARK_PROFILE_LARGE
NumScheduleIterations = 2000000,
#else
NumScheduleIterations = 200000,
#endif

I've pushed a commit that defines ROC_CPU_FAMILY macro to cpu_traits.h (f5b9cc4). We can add a header, say, src/tests/bench_profile.h, that sets ROC_BENCHMARK_PROFILE_xxx based on ROC_CPU_FAMILY.

Maybe something like this:

  • X86_64, PPC64, S390X - LARGE
  • GENERIC, X86, PPC, S390, LOONGARCH64, AARCH64, MIPS64, RISCV64 - MEDIUM
  • anything else - SMALL

Also I think we should allow overwriting selected profile via a build option, like:

scons ... --bench-profile=small

You can find examples in SConstruct, e.g. --sanitizers option. If option is given, we can add define like this: env.AppendUnique(CPPDEFINES='ROC_BENCHMARK_PROFILE_LARGE'). If option is not given, then bench_profile.h will deduce default profile from CPU.


In addition, currently number of threads in benchmarks is hard-coded, but obviously there is no good universal value. In practice we want ThreadRange(1, CpuCount()) instead of ThreadRange(1, 16).

We can implement Thread::cpu_count() in src/internal_modules/roc_core/target_posix/roc_core/thread.h and use it in benchmarks. I think it can use sched_getaffinity(), and if it fails or returns zero CPUs, it can fallback to sysconf(_SC_NPROCESSORS_ONLN) and then to _SC_NPROCESSORS_CONF if they're defined.

Also, some benchmarks use:

->Arg(1)
->Arg(2)
->Arg(4)
->Arg(8)
->Arg(16)

insteaf of ThreadsRange(1, 16), I guess we also need to rework them to use range.


Let me know what do you think about this ideas and how much time & interest do you have for this.

If you wish to implement the above, you're very welcome, otherwise let me know what parts are you interested in and I'll work on the others when I have time.

@gavv
Copy link
Member

gavv commented Jun 15, 2025

PS. cross-compiling benchmarks for aarch64 was failing for me, pushed a fix: c4ad8b8

@gavv gavv added S-needs-revision status: Author should revise PR and address feedback and removed S-review-in-progress status: PR is being reviewed labels Jun 15, 2025
@janibakin
Copy link
Author

Thanks! I'll update the PR as soon as I can. Don't want to give exact dates, but I'll aim within a week or two. I'll update with your changes and also add profiles. If time allows I can also add CpuCount(), will update you.

* Drain pipeline queue before destruction → removes SIGABRT on aarch64.
* Reduce ctl contention iterations to 200k on aarch64 on 2 GB Pi.
@janibakin janibakin force-pushed the fix-755-arm-benchmarks branch 2 times, most recently from de4672b to b3a3d67 Compare July 12, 2025 20:02
@janibakin
Copy link
Author

Hi there,

I implemented all changes except Thread::cpu_count(). I can do it for sure, and would love to, but does this have to be under same PR? Let me know.

However, I got some, hopefully, minor issues with passing checks.

This is what I did.

  1. Pushed new changes.
  2. github CI ran and reported an issue with formatting of new header file. All other checks passed.
  3. Fixed this with scons -Q fmt. Updated files were: bench_profile.h and bench_main.cpp(include header was rearranged).
  4. Did commit amend, no changes done to other files. git push --force-with-lease.
  5. This time, check have not passed.

Why could reformatting changes this? Or perhaps something else.

PS. I will also have a look later, but if you know the answer quicker, that would be great.


This far, in actual benchmarks, I split benchmark NumIterations, NumThreads and other parameters based on either LARGE profile, or anything else.

I only tested this on raspberry pi and my VM, which both should be considered MEDIUM profile. Do you think I should add different numbers for SMALL profile? If so, could you help me out with numbers there. I don't have any other devices to test this on.

- Introduce a new SCons option `--bench-profile` to select benchmark profile: small, medium, or large.
- SCons now sets the appropriate preprocessor define (`ROC_BENCHMARK_PROFILE_SMALL`, `_MEDIUM`, or `_LARGE`) based on the selected profile.
- Update benchmark sources to use these defines to control dataset sizes, thread counts, and iteration counts for various benchmarks.
- Default values are chosen if no profile is specified.
- Refactor hardcoded benchmark parameters in `bench_mpsc_queue.cpp`, `bench_task_queue_contention.cpp`, and `bench_pipeline_loop_contention.cpp` to use the new profile-based configuration.
- Included `bench_profile.h` to select default ROC_BENCHMARK_PROFILE_* for specific arch.
@janibakin janibakin force-pushed the fix-755-arm-benchmarks branch from b3a3d67 to 8c18374 Compare July 24, 2025 04:16
@janibakin
Copy link
Author

janibakin commented Jul 24, 2025

Due to previously failed ci check(failed loopback_sink_2_source test), I ran it(ci_checks/linux-x86_64/ubuntu-24.04.sh) locally, and it passed. Then I rerun whole ci checks and now they are passing. So I would assume "not stable" test.

Otherwise, this PR is ready for review. Could someone take a look at it @gavv ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contrib PR not by a maintainer S-needs-revision status: Author should revise PR and address feedback

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants