🚀 Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

Version 1.0.0 - Initial Release

Overview

A complete end-to-end pipeline for collecting and analyzing publicly shared ChatGPT conversations from Reddit to understand real-world usage patterns, interaction styles, and human-AI alignment.

✨ Key Features

Data Collection (3-Stage Pipeline)

Arctic Shift API integration for Reddit post & comment discovery
ChatGPT backend API conversation fetching with comprehensive metadata
Cloudflare bypass support with curl-cffi for robust data retrieval
Resume capability for interrupted collections

Alignment Analysis

Semantic Alignment - Cosine similarity using sentence embeddings (all-mpnet-base-v2)
Sentiment Alignment - Emotional tone matching via polarity analysis
Linguistic Style Matching (LSM) - Functional word category alignment
Lexical + Syntactic Alignment - Word overlap and POS tag similarity

Data Processing

Text normalization with ftfy for encoding fixes
Markdown cleaning while preserving code blocks
Language detection and English-only filtering
Optional PII anonymization utility

📦 What's Included

Complete collection pipeline with CLI options
Data cleaning and preprocessing utilities
4 alignment measurement modules (semantic, sentiment, LSM, lexsyn)
Topic modeling (3-model KeyNMF pipeline)
Bayesian & GAMM statistical analysis templates (R/RMarkdown)
Comprehensive output merging and feature engineering
Full reproducibility documentation

🛠️ Quick Start

# Install dependencies
pip install -r requirements.txt

# Run full pipeline
python -m src.collection.main

# Compute alignment scores
python -m src.measures.semantic_alignment
python -m src.measures.sentiment_alignment

📊 Output

Structured JSONL and CSV outputs with:

Conversation metadata and full message trees
Embeddings and sentiment scores
Alignment metrics per turn
Topic assignments
Merged feature datasets for downstream analysis

📚 Documentation

Detailed README with usage examples
Command-line option reference for all modules
Output schema documentation
Data folder structure guide
Reproducibility and ethical guidelines

⚖️ License

GNU General Public License v3.0 - See LICENSE for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

Overview

✨ Key Features

Data Collection (3-Stage Pipeline)

Alignment Analysis

Data Processing

📦 What's Included

🛠️ Quick Start

📊 Output

📚 Documentation

⚖️ License

Uh oh!

Releases: sabszh/AlignmentUnderUse

Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

🚀 Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

Overview

✨ Key Features

Data Collection (3-Stage Pipeline)

Alignment Analysis

Data Processing

📦 What's Included

🛠️ Quick Start

📊 Output

📚 Documentation

⚖️ License

Uh oh!