feat: add precision checker with hook system and command-line control #102

chen2021673 · 2026-01-13T10:17:53Z

This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training:

Core Features:

Two-level precision checking (module-level and function-level)
Command-line flags: --precision_check, --precision_check_all_ranks
Extensible hook system for Functions, Modules, and Tensors
Automatic FP32 reference computation for validation

Hook System:

Forward/backward pre/post hooks for Functions and Modules
Tensor gradient hooks for inspection
Unified hook type definitions to reduce code duplication

Implementation:

PrecisionChecker utility with configurable check levels
Integration with autograd Function and nn::Module
Support for distributed training (per-rank checking)
Detailed logging to precision_check_rank_[N].log files

Documentation:

docs/hook_mechanism.md - Hook system architecture
docs/precision_checker_guide.md - Usage guide

Testing:

test/hook/test_hook.cc - Hook functionality tests
test/hook/test_precision_check.cc - Precision checker tests

This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training: **Core Features:** - Two-level precision checking (module-level and function-level) - Command-line flags: --precision_check, --precision_check_all_ranks - Extensible hook system for Functions, Modules, and Tensors - Automatic FP32 reference computation for validation **Hook System:** - Forward/backward pre/post hooks for Functions and Modules - Tensor gradient hooks for inspection - Unified hook type definitions to reduce code duplication **Implementation:** - PrecisionChecker utility with configurable check levels - Integration with autograd Function and nn::Module - Support for distributed training (per-rank checking) - Detailed logging to precision_check_rank_[N].log files **Documentation:** - docs/hook_mechanism.md - Hook system architecture - docs/precision_checker_guide.md - Usage guide **Testing:** - test/hook/test_hook.cc - Hook functionality tests - test/hook/test_precision_check.cc - Precision checker tests Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…omprehensive docs - Add PrecisionCheckConfig and PrecisionCheckContext for better state management - Refactor precision checker to use context-based architecture - Add comprehensive documentation (hook_mechanism.md, precision_checker_guide.md) - Add test cases for hook system and precision checking - Update CMakeLists.txt to include new test targets - Improve command-line flag handling in examples Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Unify Function and Module hook infrastructure into common/hook.h - Remove duplicated HookHandle and HookHandleImpl classes - Update precision_checker_guide.md and hook_mechanism.md

chen2021673 · 2026-01-15T10:16:15Z

example/gpt2/main.cc

    int pp_rank = 0;

+    // Set thread-local global rank
+    nn::parallel::global::thread_global_rank = rank.GlobalRank();


这个全局变量后续可以看看有没有什么更优雅的替代方法

This commit fixes the issue where only rank 0 generated precision check log files when running with tensor parallelism. The root cause was that GetLogStream() used process-global static variables, causing all threads in a single process to share the same log file handle. Changes: - Add thread_global_rank thread-local variable to track per-thread rank - Convert GetLogStream() and TableHeaderPrinted() to use thread_local storage - Set thread_global_rank in Train() function for each thread - Move baseline output (key|md5 format) into table format branch to avoid duplicate output in simple format - Add directory creation and error handling for log file opening With these changes, each thread now creates its own log file based on its global rank (process_rank * nthread_per_process + thread_rank). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add tools/compare_loss.py to automate end-to-end loss comparison between two log directories, eliminating manual verification overhead as test cases scale up. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

chen2021673 and others added 4 commits January 13, 2026 10:10

style: apply clang-format to precision checker code

70397b9

refactor: unify hook infrastructure and enhance documentation

95ae35d

- Unify Function and Module hook infrastructure into common/hook.h - Remove duplicated HookHandle and HookHandleImpl classes - Update precision_checker_guide.md and hook_mechanism.md

chen2021673 commented Jan 15, 2026

View reviewed changes

chen2021673 force-pushed the precision_checker branch from d35e92a to a7806d9 Compare January 16, 2026 01:43

fix Code Format Check

2e62b67

kilinchange requested review from Chamberlain0w0, JYMiracle305 and kilinchange January 16, 2026 01:47

feat: add automated loss comparison tool for log validation

4b1b6cd

Add tools/compare_loss.py to automate end-to-end loss comparison between two log directories, eliminating manual verification overhead as test cases scale up. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add precision checker with hook system and command-line control #102

feat: add precision checker with hook system and command-line control #102

Uh oh!

chen2021673 commented Jan 13, 2026

Uh oh!

chen2021673 Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add precision checker with hook system and command-line control #102

Are you sure you want to change the base?

feat: add precision checker with hook system and command-line control #102

Uh oh!

Conversation

chen2021673 commented Jan 13, 2026

Uh oh!

chen2021673 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants