Hate Speech Detection: An NLP Pipeline for Text Classification

Overview

This repository contains a complete Natural Language Processing (NLP) pipeline designed to detect hate speech, offensive language, and neutral content. The project utilizes TF-IDF (Term Frequency-Inverse Document Frequency) along with various machine learning algorithms to classify text effectively. The repository includes essential components such as exploratory data analysis (EDA), data preprocessing, model training, evaluation, and a reusable Python script for making predictions.

Features

Comprehensive NLP pipeline for hate speech detection
Utilizes TF-IDF for feature extraction
Implements various machine learning models, including Random Forest
Includes exploratory data analysis (EDA) to understand the dataset
Offers a Python script for easy predictions
Well-documented Jupyter notebooks for learning and reference

Technologies Used

Python
Scikit-learn
Pandas
NumPy
Matplotlib
Seaborn
Jupyter Notebook

Installation

To get started with this project, clone the repository and install the required packages.

git clone https://github.com/emojipasta/Hate-Speech-Detection.git
cd Hate-Speech-Detection
pip install -r requirements.txt

Usage

After installation, you can explore the Jupyter notebooks for a detailed understanding of the data processing and model training steps.

To run the prediction script, use the following command:

python predict.py --input "Your text here"

This will output whether the input text is hate speech, offensive, or neutral.

Data Exploration

The dataset used for this project is critical for training the models. The EDA section of the Jupyter notebook provides insights into the distribution of classes, common words, and other useful statistics.

Key Insights

Distribution of hate speech vs. neutral content
Common words in each category
Visualization of class distribution

Model Training

The project implements several machine learning models to classify the text data. The primary model used is the Random Forest classifier, known for its robustness and accuracy.

Training Process

Data Preprocessing: Cleaning and preparing the data for training.
Feature Extraction: Using TF-IDF to convert text into numerical format.
Model Selection: Choosing the best-performing model based on evaluation metrics.

Evaluation Metrics

Accuracy
Precision
Recall
F1 Score

The evaluation section provides detailed results for each model trained.

Evaluation

After training the models, we evaluate their performance using a separate test dataset. The results are visualized to understand how well the models classify hate speech.

Results

Confusion matrix for each model
ROC curves
Comparison of model performance

Prediction Script

The repository includes a reusable Python script for making predictions on new text. This script allows users to input text and receive immediate feedback on its classification.

To download the latest version of the script, visit the Releases section.

Contributing

Contributions are welcome! If you have suggestions for improvements or want to add features, feel free to fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Links

For more details and to download the latest release, visit the Releases section.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
images		images
notebook		notebook
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hate Speech Detection: An NLP Pipeline for Text Classification

Table of Contents

Overview

Features

Technologies Used

Installation

Usage

Data Exploration

Key Insights

Model Training

Training Process

Evaluation Metrics

Evaluation

Results

Prediction Script

Contributing

License

Links

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

emojipasta/Hate-Speech-Detection

Folders and files

Latest commit

History

Repository files navigation

Hate Speech Detection: An NLP Pipeline for Text Classification

Table of Contents

Overview

Features

Technologies Used

Installation

Usage

Data Exploration

Key Insights

Model Training

Training Process

Evaluation Metrics

Evaluation

Results

Prediction Script

Contributing

License

Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages