Evaluation toolkit for video-to-audio generation

Evaluation toolkit from SoundReactor

Contanct: Koichi Saito: koichi.saito@sony.com

Overview

This repository supports the evaluations of:

Fréchet Distances (FD)
- FD_VGG, with VGGish
  - For 16kHz
- FD_PANNs, with PANNs
  - For 32kHz
- FD_PassT, with PaSST
  - For 32kHz
- FD_OpenL3, with OpenL3
  - For 48kHz
- FD_L-CLAP, with LAION-CLAP
  - For 48kHz
- You can refer FADTK or KADTK for choosing pretrained backbone of audio encoder.
Maximum Mean Discrepancy (MMD) Image, Music, Audio
- MMD_VGG, with VGGish
  - For 16kHz
- MMD_PANNs, with PANNs
  - For 32kHz
- MMD_PassT, with PaSST
  - For 32kHz
- MMD_OpenL3, with OpenL3
  - For 48kHz
- MMD_L-CLAP, with LAION-CLAP
  - For 48kHz
- You can refer FADTK or KADTK for choosing pretrained backbone of audio encoder.
Fréchet Stereo Audio Distances FSAD
- FD with StereoCRW
- Please check here to run FSAD.
Inception Scores (IS)
- IS_PassT, with PaSST
  - For 32kHz
- IS_PANNs, with PANNs
  - For 32kHz
Mean KL Distances (MKL)
- KL_PassT, with PaSST
  - For 32kHz
- KL_PANNs, with PANNs
  - For 32kHz
CLAP Scores
- LAION_CLAP, cosine similarity between text and audio embeddings computed by LAION-CLAP.
ImageBind Score

Cosine similarity between video and audio embeddings computed by ImageBind, sometimes scaled by 100
DeSync Score

Average misalignment (in seconds) predicted by Synchformer with the 24-01-04T16-39-21 model trained on AudioSet. We average the results from the first 4.8 seconds and last 4.8 seconds of each video-audio pair.

Installation

1. docker (recommended)

Install docker and build docker container. You can build dockerfile is located at container/dockerfile.

docker build -t tag .

2. miniforge

Or you can install via miniforge. Yaml file is located at container/environment.yml.

mamba env create -f environment.yml

Then

mamba activate v2a_eval

Then install pytorch and flash-attn (we only tested on this version but it might work on different one.)

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.4.1 --no-build-isolation

For Video Evaluation

If you plan to evaluate on videos, you will also need ffmpeg. Note that torchaudio imposes a maximum version limit (ffmpeg<7).

Download Pretrained Models

Download LAION-CLAP checkpoint. Specify path to the checkpoint at --clap_ckpt_path in the shell script.
Download Synchformer. Specify path to the checkpoint at --syncformer_ckpt_path in the shell script.
Download StereoCRW. Put the checkpoint under stereoFAD/checkpoints/pretrained-models.

Usage

Overview

Evaluation is a two-stage process:

Extraction: extract video/text/audio features for ground-truth and audio features for the predicted audios.
Evaluation: compute the desired metrics using the extracted features.

*For FSAD, please check here.

Extraction

1. Video feature extraction (optional).

For video-to-audio applications, visual features are extracted from input videos. Input requirements:

Videos in .mp4 format (any FPS or resolution).
Video names must match the corresponding audio file names (excluding extensions).

Run the following to extract visual features using Synchformer and ImageBind:

bash extract_video.sh

2. Text feature extraction (optional).

For applications using text, text features are extracted from input text data.

Input requirements:

A CSV file with at least two columns with a header row:
- name: Matches the corresponding audio file name (excluding extensions).
- caption: The text associated with the audio.

Run the following to extract text features using LAION-CLAP:

bash extract_text.sh

Evaluation

You can run evaluation via

bash evaluate.sh

Since pretrained audio feature extractors are trained with mono audio except for StereoCRW, which we can compute FSAD separately at stereoFAD/run_eval.sh, audio samples are loaded as mono.

You can spesify mono type via --mono_type such as 'mean', 'left', 'right', and 'side' (difference between left and right).

If you don't want to recompute audio feature extractions, you can skip them by excluding

--recompute_gt_cache for ground-truth audio features.
--recompute_pred_cache for predicted audio features.

If there is no extracted features on text and videos, metrics related to those features are automatically skipped.

Citation

To cite this repository, please use the following BibTeX entry:

@article{saito2025soundreactor,
  title={SoundReactor: Frame-level Online Video-to-Audio Generation},
  author={Koichi Saito and Julian Tanke and Christian Simon and Masato Ishii and Kazuki Shimada and Zachary Novack and Zhi Zhong and Akio Hayakawa and Takashi Shibuya and Yuki Mitsufuji},
  year={2025},
  eprint={2510.02110},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2510.02110}, 
  journal={arXiv preprint arXiv:2510.02110},
}

Acknowledgment

https://github.com/hkchengrex/av-benchmark

https://github.com/PeiwenSun2000/Both-Ears-Wide-Open/

https://github.com/microsoft/fadtk/

https://github.com/YoonjinXD/kadtk/

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
av_bench		av_bench
container		container
stereoFAD		stereoFAD
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
csv_file.csv		csv_file.csv
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
extract_text.py		extract_text.py
extract_text.sh		extract_text.sh
extract_video.py		extract_video.py
extract_video.sh		extract_video.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluation toolkit for video-to-audio generation

Contanct: Koichi Saito: koichi.saito@sony.com

Overview

Installation

1. docker (recommended)

2. miniforge

For Video Evaluation

Download Pretrained Models

Usage

Overview

Extraction

1. Video feature extraction (optional).

2. Text feature extraction (optional).

Evaluation

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

License

SonyResearch/soundreactor_v2a_eval

Folders and files

Latest commit

History

Repository files navigation

Evaluation toolkit for video-to-audio generation

Contanct: Koichi Saito: koichi.saito@sony.com

Overview

Installation

1. docker (recommended)

2. miniforge

For Video Evaluation

Download Pretrained Models

Usage

Overview

Extraction

1. Video feature extraction (optional).

2. Text feature extraction (optional).

Evaluation

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages