Using Information Theory to Characterize Prosodic Typology

This project explores fastText, mGPT, mBERT and a C-KDE way with no LLMs to predict the lexical identity of a word from its prosody. We then use the information-theoretic measure of entropy to quantify the amount of information that prosody conveys about lexical identity.

🔧 Installation & Setup

conda create --name prosody python=3.10
conda activate prosody
conda install -r requirements.txt

Or you can install the requirements with environment.yaml file:

conda env create -f environment.yaml

Data

Download the already prepared prominence dataset:

cd data
git clone https://github.com/Helsinki-NLP/prosody.git

Download the Common Voice dataset used in this project:

You can download the Common Voice dataset from the Common Voice website.

Prepare the Common Voice dataset:

You need to preprocess the Common Voice dataset to extract the prosodic features. You can use the following script to extract the prosodic features:

MFA (Montreal Forced Aligner) to align the audio files with the text files. Tutorial by Eleanor Chodroff: link the MFA tool: link.
You can extract the prosodic features f0 using the extract.py script, and prominence features using the extract_prominence.py script. For other prosodic features, you can change the arguments in the extract.py script. Or run the bash script f0_extraction.sh in your server to extract the f0 features.

Running the experiments

We followed the models in the project Quantifying the redundancy between prosody and text and enhanced the performance by adapting the Conditional Density Estimation (CDE) method in this repository: CDE. Please also see the documentation for more information: CDE Documentation.

1. Model training and conditional entropy estimation

All the bash script for training the models are in the scripts folder. In the baselines subfolder, you can find the bash scripts for using the fastText models with CDE. In the mGPT subfolder, you can find the bash scripts for using the mGPT models with CDE. In the mBERT subfolder, you can find the bash scripts for using the mBERT models with CDE. In the notebooks folder, you can find the bash scripts for using the C-KDE-all and C-KDE-split models.

2. KDE and differential entropy estimation

In the notebooks folder, you can find the scripts for using the KDE and Monte Carlo sampling for differential entropy estimation.

Folder Structure

The directory structure of new project looks like this:

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                <- Callbacks configs
│   ├── data                     <- Data configs
│   ├── debug                    <- Debugging configs
│   ├── experiment               <- Experiment configs
│   ├── extras                   <- Extra utilities configs
│   ├── hparams_search           <- Hyperparameter search configs
│   ├── hydra                    <- Hydra configs
│   ├── local                    <- Local configs
│   ├── logger                   <- Logger configs
│   ├── model                    <- Model configs
│   ├── paths                    <- Project paths configs
│   ├── trainer                  <- Trainer configs
│   │
│   ├── eval.yaml             <- Main config for evaluation
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── notebooks              <- Jupyter notebooks
│
├── r_scripts              <- R scripts for data analysis and visualization
│
├── results                <- Results of the experiments for data analysis and visualization in R
│
├── scripts                <- Shell scripts
│
├── src                    <- Source code
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   ├── eval.py                  <- Run evaluation
│   ├── train_cv.py              <- Run training with cross-validation
│   ├── train.py                 <- Run training
│   ├── extraction.py            <- Extract f0 features
│   └── extraction_prominence.py <- Extract prominence features
│  
│
├── tests                  <- Tests of any kind
│  
├── visualization          <- Visualization figures
│
├── .env.example              <- Example of file for storing private environment variables
├── .gitignore                <- List of files ignored by git
├── .pre-commit-config.yaml   <- Configuration of pre-commit hooks for code formatting
├── .project-root             <- File for inferring the position of project root directory
├── environment.yaml          <- File for installing conda environment
├── Makefile                  <- Makefile with commands like `make train` or `make test`
├── pyproject.toml            <- Configuration options for testing and linting
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Information Theory to Characterize Prosodic Typology

🔧 Installation & Setup

Data

Download the already prepared prominence dataset:

Download the Common Voice dataset used in this project:

Prepare the Common Voice dataset:

Running the experiments

1. Model training and conditional entropy estimation

2. KDE and differential entropy estimation

Folder Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
notebooks		notebooks
r_script		r_script
results		results
scripts		scripts
src		src
tests		tests
visualization		visualization
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

cuierd/crosslingual-redundancy

Folders and files

Latest commit

History

Repository files navigation

Using Information Theory to Characterize Prosodic Typology

🔧 Installation & Setup

Data

Download the already prepared prominence dataset:

Download the Common Voice dataset used in this project:

Prepare the Common Voice dataset:

Running the experiments

1. Model training and conditional entropy estimation

2. KDE and differential entropy estimation

Folder Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages