Skip to content

[Feature] Add data validation steps to trainer.dev for early error detection #385

@dalek-who

Description

@dalek-who

Description

Currently, data conversion steps (e.g., converting data to PyArrow arrays) only run inside trainer.fit. This causes two issues:

  1. Data format errors are only caught during training, not during development.
  2. When debugging in VSCode, breakpoints set in python libs (e.g., arrow_writer.py) are not hit during trainer.dev, making debugging harder.

Example

In trainer.fit, one conversion step is:

# From: /root/miniconda3/envs/agl/lib/python3.12/site-packages/datasets/arrow_writer.py
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))

If the data format is incorrect, it may raise an error, but the breakpoint here will not hit.

Proposed Solution

Add the same data validation/conversion steps of trainer.fit to trainer.dev. This will:

  1. Catch data format errors earlier.
  2. Allow breakpoints in data processing code to be triggered during development.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions