-
Notifications
You must be signed in to change notification settings - Fork 18
feat: add precision checker with hook system and command-line control #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
chen2021673
wants to merge
7
commits into
master
Choose a base branch
from
precision_checker
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training: **Core Features:** - Two-level precision checking (module-level and function-level) - Command-line flags: --precision_check, --precision_check_all_ranks - Extensible hook system for Functions, Modules, and Tensors - Automatic FP32 reference computation for validation **Hook System:** - Forward/backward pre/post hooks for Functions and Modules - Tensor gradient hooks for inspection - Unified hook type definitions to reduce code duplication **Implementation:** - PrecisionChecker utility with configurable check levels - Integration with autograd Function and nn::Module - Support for distributed training (per-rank checking) - Detailed logging to precision_check_rank_[N].log files **Documentation:** - docs/hook_mechanism.md - Hook system architecture - docs/precision_checker_guide.md - Usage guide **Testing:** - test/hook/test_hook.cc - Hook functionality tests - test/hook/test_precision_check.cc - Precision checker tests Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…omprehensive docs - Add PrecisionCheckConfig and PrecisionCheckContext for better state management - Refactor precision checker to use context-based architecture - Add comprehensive documentation (hook_mechanism.md, precision_checker_guide.md) - Add test cases for hook system and precision checking - Update CMakeLists.txt to include new test targets - Improve command-line flag handling in examples Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Unify Function and Module hook infrastructure into common/hook.h - Remove duplicated HookHandle and HookHandleImpl classes - Update precision_checker_guide.md and hook_mechanism.md
chen2021673
commented
Jan 15, 2026
| int pp_rank = 0; | ||
|
|
||
| // Set thread-local global rank | ||
| nn::parallel::global::thread_global_rank = rank.GlobalRank(); |
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个全局变量后续可以看看有没有什么更优雅的替代方法
This commit fixes the issue where only rank 0 generated precision check log files when running with tensor parallelism. The root cause was that GetLogStream() used process-global static variables, causing all threads in a single process to share the same log file handle. Changes: - Add thread_global_rank thread-local variable to track per-thread rank - Convert GetLogStream() and TableHeaderPrinted() to use thread_local storage - Set thread_global_rank in Train() function for each thread - Move baseline output (key|md5 format) into table format branch to avoid duplicate output in simple format - Add directory creation and error handling for log file opening With these changes, each thread now creates its own log file based on its global rank (process_rank * nthread_per_process + thread_rank). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
d35e92a to
a7806d9
Compare
Add tools/compare_loss.py to automate end-to-end loss comparison between two log directories, eliminating manual verification overhead as test cases scale up. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training:
Core Features:
Hook System:
Implementation:
Documentation:
Testing: