- Overview
- Features
- Technologies Used
- Installation
- Usage
- Data Exploration
- Model Training
- Evaluation
- Prediction Script
- Contributing
- License
- Links
This repository contains a complete Natural Language Processing (NLP) pipeline designed to detect hate speech, offensive language, and neutral content. The project utilizes TF-IDF (Term Frequency-Inverse Document Frequency) along with various machine learning algorithms to classify text effectively. The repository includes essential components such as exploratory data analysis (EDA), data preprocessing, model training, evaluation, and a reusable Python script for making predictions.
- Comprehensive NLP pipeline for hate speech detection
- Utilizes TF-IDF for feature extraction
- Implements various machine learning models, including Random Forest
- Includes exploratory data analysis (EDA) to understand the dataset
- Offers a Python script for easy predictions
- Well-documented Jupyter notebooks for learning and reference
- Python
- Scikit-learn
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Jupyter Notebook
To get started with this project, clone the repository and install the required packages.
git clone https://github.com/emojipasta/Hate-Speech-Detection.git
cd Hate-Speech-Detection
pip install -r requirements.txtAfter installation, you can explore the Jupyter notebooks for a detailed understanding of the data processing and model training steps.
To run the prediction script, use the following command:
python predict.py --input "Your text here"This will output whether the input text is hate speech, offensive, or neutral.
The dataset used for this project is critical for training the models. The EDA section of the Jupyter notebook provides insights into the distribution of classes, common words, and other useful statistics.
- Distribution of hate speech vs. neutral content
- Common words in each category
- Visualization of class distribution
The project implements several machine learning models to classify the text data. The primary model used is the Random Forest classifier, known for its robustness and accuracy.
- Data Preprocessing: Cleaning and preparing the data for training.
- Feature Extraction: Using TF-IDF to convert text into numerical format.
- Model Selection: Choosing the best-performing model based on evaluation metrics.
- Accuracy
- Precision
- Recall
- F1 Score
The evaluation section provides detailed results for each model trained.
After training the models, we evaluate their performance using a separate test dataset. The results are visualized to understand how well the models classify hate speech.
- Confusion matrix for each model
- ROC curves
- Comparison of model performance
The repository includes a reusable Python script for making predictions on new text. This script allows users to input text and receive immediate feedback on its classification.
To download the latest version of the script, visit the Releases section.
Contributions are welcome! If you have suggestions for improvements or want to add features, feel free to fork the repository and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.
For more details and to download the latest release, visit the Releases section.