Currently, we keep all full-state checkpoints and hf_format checkpoints. This uses a lot of storage for the sake of resumeability. Instead, we could have the same outcome behavior if we only kept the most recent full-state checkpoint. Keeping all checkpoints could be additional configuration.