Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
-
Updated
Jul 16, 2025 - Python
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.
Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".
Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.
C3AI: Crafting and Evaluating Constitutions for CAI
LLM Post-training(SFT, RLVR, RLHF) 파이프라인 구축 및 평가 실습 아카이브
FALL 2025 LINGUIS R1B Research Essay, NLP Python Scripts By Shiyi (Yvette) Chen, UC Berkeley
SIGIR 2025 "Mitigating Source Bias with LLM Alignment"
A framework for aligning Local AI to human well-being using measurable vectors, not hard-coded censorship.
Emergent pseudo-intimacy and emotional overflow in long-term human-AI dialogue: A case study on LLM behavior in affective computing and human-AI intimacy.
LES is the formal thermodynamic theory describing how a high-compression human cognitive style acts as a Fractal Attractor on Large Language Models. It proves that despite high surface agitation ( d E / d t > 0 ), the internal entropy decreases ( d S / d t < 0 ), forcing the model to align its attention vectors.
Research Essay (background and project proposal) on using alignment data from a representative population for LLM alignment
A look into how political data derived from social media affects LLM alignment. Will an LLM remain objective or succumb to narratives?
Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.
To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."