Skip to content

quantifying redundancy across multiple languages and comparison

Notifications You must be signed in to change notification settings

cuierd/crosslingual-redundancy

Repository files navigation

Using Information Theory to Characterize Prosodic Typology

This project explores fastText, mGPT, mBERT and a C-KDE way with no LLMs to predict the lexical identity of a word from its prosody. We then use the information-theoretic measure of entropy to quantify the amount of information that prosody conveys about lexical identity.


🔧 Installation & Setup

conda create --name prosody python=3.10
conda activate prosody
conda install -r requirements.txt

Or you can install the requirements with environment.yaml file:

conda env create -f environment.yaml

Data

Download the already prepared prominence dataset:

cd data
git clone https://github.com/Helsinki-NLP/prosody.git

Download the Common Voice dataset used in this project:

You can download the Common Voice dataset from the Common Voice website.

Prepare the Common Voice dataset:

You need to preprocess the Common Voice dataset to extract the prosodic features. You can use the following script to extract the prosodic features:

  1. MFA (Montreal Forced Aligner) to align the audio files with the text files. Tutorial by Eleanor Chodroff: link the MFA tool: link.
  2. You can extract the prosodic features f0 using the extract.py script, and prominence features using the extract_prominence.py script. For other prosodic features, you can change the arguments in the extract.py script. Or run the bash script f0_extraction.sh in your server to extract the f0 features.

Running the experiments

We followed the models in the project Quantifying the redundancy between prosody and text and enhanced the performance by adapting the Conditional Density Estimation (CDE) method in this repository: CDE. Please also see the documentation for more information: CDE Documentation.

1. Model training and conditional entropy estimation

All the bash script for training the models are in the scripts folder. In the baselines subfolder, you can find the bash scripts for using the fastText models with CDE. In the mGPT subfolder, you can find the bash scripts for using the mGPT models with CDE. In the mBERT subfolder, you can find the bash scripts for using the mBERT models with CDE. In the notebooks folder, you can find the bash scripts for using the C-KDE-all and C-KDE-split models.

2. KDE and differential entropy estimation

In the notebooks folder, you can find the scripts for using the KDE and Monte Carlo sampling for differential entropy estimation.


Folder Structure

The directory structure of new project looks like this:

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                <- Callbacks configs
│   ├── data                     <- Data configs
│   ├── debug                    <- Debugging configs
│   ├── experiment               <- Experiment configs
│   ├── extras                   <- Extra utilities configs
│   ├── hparams_search           <- Hyperparameter search configs
│   ├── hydra                    <- Hydra configs
│   ├── local                    <- Local configs
│   ├── logger                   <- Logger configs
│   ├── model                    <- Model configs
│   ├── paths                    <- Project paths configs
│   ├── trainer                  <- Trainer configs
│   │
│   ├── eval.yaml             <- Main config for evaluation
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── notebooks              <- Jupyter notebooks
│
├── r_scripts              <- R scripts for data analysis and visualization
│
├── results                <- Results of the experiments for data analysis and visualization in R
│
├── scripts                <- Shell scripts
│
├── src                    <- Source code
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   ├── eval.py                  <- Run evaluation
│   ├── train_cv.py              <- Run training with cross-validation
│   ├── train.py                 <- Run training
│   ├── extraction.py            <- Extract f0 features
│   └── extraction_prominence.py <- Extract prominence features
│  
│
├── tests                  <- Tests of any kind
│  
├── visualization          <- Visualization figures
│
├── .env.example              <- Example of file for storing private environment variables
├── .gitignore                <- List of files ignored by git
├── .pre-commit-config.yaml   <- Configuration of pre-commit hooks for code formatting
├── .project-root             <- File for inferring the position of project root directory
├── environment.yaml          <- File for installing conda environment
├── Makefile                  <- Makefile with commands like `make train` or `make test`
├── pyproject.toml            <- Configuration options for testing and linting
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

About

quantifying redundancy across multiple languages and comparison

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published