Description
Currently, data conversion steps (e.g., converting data to PyArrow arrays) only run inside trainer.fit. This causes two issues:
- Data format errors are only caught during training, not during development.
- When debugging in VSCode, breakpoints set in python libs (e.g., arrow_writer.py) are not hit during trainer.dev, making debugging harder.
Example
In trainer.fit, one conversion step is:
# From: /root/miniconda3/envs/agl/lib/python3.12/site-packages/datasets/arrow_writer.py
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
If the data format is incorrect, it may raise an error, but the breakpoint here will not hit.
Proposed Solution
Add the same data validation/conversion steps of trainer.fit to trainer.dev. This will:
- Catch data format errors earlier.
- Allow breakpoints in data processing code to be triggered during development.