OpDetect: A convolutional and recurrent neural network classifier for precise and sensitive operon detection from RNA-seq data
An operon refers to a group of neighbouring genes belonging to one or more overlapping transcription units that are transcribed in the same direction and have at least one gene in common. Operons are a characteristic of prokaryotic genomes. Identifying which genes belong to the same operon facilitates understanding gene function and regulation. There are several computational approaches for operon detection; however, many of these computational approaches have been developed for a specific target bacterium or require information only available for a restricted number of bacterial species. Here, we develop a novel general method, OpDetect, that directly utilizes RNA-sequencing (RNA-seq) reads as a signal over nucleotide bases in the genome. This representation enabled us to employ a convolutional and recurrent deep neural network architecture which demonstrated superior performance in terms of recall, f1-score and AUROC compared to previous approaches. Additionally, OpDetect showcases species-agnostic capabilities, successfully detecting operons in a wide range of bacterial species and even in Caenorhabditis elegans, one of few eukaryotic organisms known to have operons.
- Species-Agnostic: Works across a wide range of bacterial species and can even detect operons in Caenorhabditis elegans, one of few eukaryotes known to have operons.
- Deep Learning Model: Utilizes a CNN-LSTM architecture to capture spatial and sequential features from RNA-seq data.
- Direct RNA-Seq Integration: Does not rely on external databases, enabling application to a broader range of organisms.
- Superior Performance: Outperforms state-of-the-art tools like OperonMapper, Rockhopper, OperonSEQer, and OperonFinder.
- Python 3.10+
- Required packages:
numpy(v1.24.4)pandas(v2.1.0)matplotlib(v3.8.4)scipy(v1.10.1)tensorflow(v2.16.1)scikit-learn(v1.2.1)
- Clone the repository:
git clone https://github.com/BioinformaticsLabAtMUN/OpDetect.git cd OpDetect - Install dependencies:
pip install -r requirements.txt
-
Prepare Read Counts and Labels:
-
The initial steps in processing the data involve trimming and filtering the raw sequencing data in FASTQ format. We performed this process using Fastp (version 0.23.1). The trimmed and filtered FastQ files are then aligned to the reference genomes using HISAT2 (version 2.2.1), and the read coverage for each genome base is extracted using SAMtools (version 1.17) and BEDtools (version 2.30.0). The specific commands for data preparation (excluding mapping) are:
A) Fastp command:
fastp -i in.R1.fq [-I in.R2.fq] -o out.R1.fq [-O out.R2.fq] --cut_front_window_size 1 --cut_front_mean_quality 3 -r --cut_right_window_size 4 --cut_right_mean_quality 15B) Commands to convert SAM to BAM and sort BAM:
samtools view -b -o BAM_file.bam SAM_file.sam samtools sort BAM_file.bam -o sorted_BAM_file.bamC) Bedtools command to obtain read counts:
bedtools genomecov -d -ibam sorted_BAM_file.bam > read_counts.bed -
Use the scripts in
2_process_readsand3_labelsdirectories to prepare the read count files and labels.
-
-
Prepare Data Directory:
- Create a subdirectory in
0_datanamed with thetxidof the organism (e.g.,txid12345). - Place the following files in the subdirectory:
- Read count files with names like
base_cov_SAMPLEID. - Gene annotation file (e.g.,
gene_annotation.bed). - [Optional] Labels file listing genes in the same operon on the same line.
- Read count files with names like
- Create a subdirectory in
-
Process Data and Run Test: From the main directory, execute the following commands:
python 4_data_process/integrate.py txid12345 0_data gene_annotation.bed base_cov labels 0_data/data_integrated.pkl
- Replace
txid12345with your organism-specific directory name. - Adjust paths to match your directory structure.
python 4_data_process/process.py 0_data data_integrated.pkl data_processed.npz TEST 0_data/txid12345/gene_pairs.csv
- Outputs
gene_pairs.csvcontaining gene-pair labels.
- Replace
-
Run Test: Go to the directory
6_test, then run:python test.py ../0_data models/versions OpDetect data_processed.npz ../0_data/txid12345/gene_pairs.csv
Below is an example workflow with txid272942:
python 4_data_process/integrate.py txid272942 0_data gene_annotation.bed base_cov labels 0_data/data_integrated.pkl
python 4_data_process/process.py 0_data data_integrated.pkl data_processed.npz TEST 0_data/txid272942/gene_pairs.csv
cd 6_test/
python test.py ../0_data models/versions OpDetect data_processed.npz ../0_data/txid272942/gene_pairs.csvThe resulting predicted labels will be saved in 7_compare/outputs/OpDetect_txid272942.csv.
Note: If the label file is unavailable and you only want predictions without true labels, enter NA in place of the label file name when running integrate.py. This will assign a value of -1 to every gene pair instead of the true labels. The code will then be modified as follows:
python 4_data_process/integrate.py txid272942 0_data gene_annotation.bed base_cov NA 0_data/data_integrated.pkl
python 4_data_process/process.py 0_data data_integrated.pkl data_processed.npz TEST 0_data/txid272942/gene_pairs.csv
cd 6_test/
python test.py ../0_data models/versions OpDetect data_processed.npz ../0_data/txid272942/gene_pairs.csvDone with txid272942
Size: 3652
Labels:
label
0 2061
1 1426
2 165
Name: count, dtype: int64
---------------------------------------
resample done
scale done
smooth done
same_length done
combine done
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer) │ (None, 150, 6, 3) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv2d (Conv2D) │ (None, 146, 1, 64) │ 5,824 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ lambda (Lambda) │ (None, 146, 64) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ lstm (LSTM) │ [(None, 146, 64), (None, │ 33,024 │
│ │ 64), (None, 64)] │ │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ self_attention (SelfAttention) │ [(None, 1024), (None, 16, │ 2,560 │
│ │ 146)] │ │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense) │ (None, 2) │ 2,050 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 43,458 (169.76 KB)
Trainable params: 43,458 (169.76 KB)
Non-trainable params: 0 (0.00 B)
********************txid272942********************
0 non-operons were not labeled and 0 operons were not labeled
Classification report
precision recall f1-score support
0 0.77 0.63 0.70 2061
1 0.58 0.73 0.65 1426
accuracy 0.67 3487
macro avg 0.68 0.68 0.67 3487
weighted avg 0.69 0.67 0.68 3487
Predicted 0 1 All
True
0 1305 756 2061
1 381 1045 1426
All 1686 1801 3487
Total F1 score, recall and AUROC
F1 score: 0.65
Recall: 0.73
AUROC: 0.79
**************************************************
- Recall
- F1-score
- AUROC
OpDetect consistently outperforms existing methods on validation organisms, achieving in 10-fold cross-validation:
- Recall: 89.17%
- F1-score: 89.71%
- AUROC: 0.89
If you use OpDetect in your research, please cite:
Rezvan Karaji and Lourdes Peña-Castillo, "OpDetect: A Convolutional and Recurrent Neural Network Classifier for Precise and Sensitive Operon Detection from RNA-seq Data," PLoS ONE, 2025.
This tool was developed at the Department of Computer Science and Department of Biology, Memorial University of Newfoundland. Special thanks to the contributors and collaborators who made this work possible.
For questions or support, contact:
- Lourdes Peña-Castillo: lourdes@mun.ca