Releases: sabszh/AlignmentUnderUse
Releases · sabszh/AlignmentUnderUse
Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis
🚀 Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis
Version 1.0.0 - Initial Release
Overview
A complete end-to-end pipeline for collecting and analyzing publicly shared ChatGPT conversations from Reddit to understand real-world usage patterns, interaction styles, and human-AI alignment.
✨ Key Features
Data Collection (3-Stage Pipeline)
- Arctic Shift API integration for Reddit post & comment discovery
- ChatGPT backend API conversation fetching with comprehensive metadata
- Cloudflare bypass support with
curl-cffifor robust data retrieval - Resume capability for interrupted collections
Alignment Analysis
- Semantic Alignment - Cosine similarity using sentence embeddings (all-mpnet-base-v2)
- Sentiment Alignment - Emotional tone matching via polarity analysis
- Linguistic Style Matching (LSM) - Functional word category alignment
- Lexical + Syntactic Alignment - Word overlap and POS tag similarity
Data Processing
- Text normalization with
ftfyfor encoding fixes - Markdown cleaning while preserving code blocks
- Language detection and English-only filtering
- Optional PII anonymization utility
📦 What's Included
- Complete collection pipeline with CLI options
- Data cleaning and preprocessing utilities
- 4 alignment measurement modules (semantic, sentiment, LSM, lexsyn)
- Topic modeling (3-model KeyNMF pipeline)
- Bayesian & GAMM statistical analysis templates (R/RMarkdown)
- Comprehensive output merging and feature engineering
- Full reproducibility documentation
🛠️ Quick Start
# Install dependencies
pip install -r requirements.txt
# Run full pipeline
python -m src.collection.main
# Compute alignment scores
python -m src.measures.semantic_alignment
python -m src.measures.sentiment_alignment📊 Output
Structured JSONL and CSV outputs with:
- Conversation metadata and full message trees
- Embeddings and sentiment scores
- Alignment metrics per turn
- Topic assignments
- Merged feature datasets for downstream analysis
📚 Documentation
- Detailed README with usage examples
- Command-line option reference for all modules
- Output schema documentation
- Data folder structure guide
- Reproducibility and ethical guidelines
⚖️ License
GNU General Public License v3.0 - See LICENSE for details