Skip to content

Releases: sabszh/AlignmentUnderUse

Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

04 Jan 20:39

Choose a tag to compare

🚀 Alignment Under Use - Research Pipeline for ChatGPT Conversation Analysis

Version 1.0.0 - Initial Release

Overview

A complete end-to-end pipeline for collecting and analyzing publicly shared ChatGPT conversations from Reddit to understand real-world usage patterns, interaction styles, and human-AI alignment.

✨ Key Features

Data Collection (3-Stage Pipeline)

  • Arctic Shift API integration for Reddit post & comment discovery
  • ChatGPT backend API conversation fetching with comprehensive metadata
  • Cloudflare bypass support with curl-cffi for robust data retrieval
  • Resume capability for interrupted collections

Alignment Analysis

  • Semantic Alignment - Cosine similarity using sentence embeddings (all-mpnet-base-v2)
  • Sentiment Alignment - Emotional tone matching via polarity analysis
  • Linguistic Style Matching (LSM) - Functional word category alignment
  • Lexical + Syntactic Alignment - Word overlap and POS tag similarity

Data Processing

  • Text normalization with ftfy for encoding fixes
  • Markdown cleaning while preserving code blocks
  • Language detection and English-only filtering
  • Optional PII anonymization utility

📦 What's Included

  • Complete collection pipeline with CLI options
  • Data cleaning and preprocessing utilities
  • 4 alignment measurement modules (semantic, sentiment, LSM, lexsyn)
  • Topic modeling (3-model KeyNMF pipeline)
  • Bayesian & GAMM statistical analysis templates (R/RMarkdown)
  • Comprehensive output merging and feature engineering
  • Full reproducibility documentation

🛠️ Quick Start

# Install dependencies
pip install -r requirements.txt

# Run full pipeline
python -m src.collection.main

# Compute alignment scores
python -m src.measures.semantic_alignment
python -m src.measures.sentiment_alignment

📊 Output

Structured JSONL and CSV outputs with:

  • Conversation metadata and full message trees
  • Embeddings and sentiment scores
  • Alignment metrics per turn
  • Topic assignments
  • Merged feature datasets for downstream analysis

📚 Documentation

  • Detailed README with usage examples
  • Command-line option reference for all modules
  • Output schema documentation
  • Data folder structure guide
  • Reproducibility and ethical guidelines

⚖️ License

GNU General Public License v3.0 - See LICENSE for details