Single CLI surface for planning sweeps, launching jobs (directly or by way of containers), and monitoring SLURM runs by way of declarative configs. See SPEC.md for platform-wide goals; this README focuses on the workflows you touch every day.
For you own experiments, first create your own branch exp_YOURNAME. Add a folder config/experiments/YOURNAME. Then, within that folder you can add your own experiment composition files (see the existing ones), with # @package _global_ as header to ensure it's located at the top-level of the config. You can then run your experiment with:
PYTHONPATH=. python scripts/run_autoexp_container.py --config-ref experiments/YOURNAME/myexperimentFirst, clone this repository including its submodules:
git clone https://github.com/OpenEuroLLM/oellm-autoexp.git --recurse-submodulesThen, install it to have the basic requirements installed:
pip install -e .Whenever you have numpy>=2.0 in your system, apply the annoying numpy.product error patch:
bash ./apply_megatron_numpy_product_patch.sh- Install prerequisites outside the container (rccl-tuner, cray-python, etc.) following the LUMI docs. (SEE: https://github.com/sfantao/rccl-tuner.git)
- Build the Megatron container from the provided defs (see
container/megatron/MegatronTrainingLumi.def.in) so the correct ROCm + network tuning ends up inside the image. - Export the usual SLURM/paths (at a minimum
SLURM_ACCOUNT,SLURM_PARTITION[_DEBUG],CONTAINER_CACHE_DIR,OUTPUT_DIR) in your profile—scripts read them automatically. - Apply the numpy patch, as the container numpy version is too new for
np.productto be supported still
You need to install oellm-autoexp or its requirements in a conda environment to run it on MARENOSTRUM. To do this:
- Install conda-pack with:
conda install conda-packon your local machine - Create new conda environment with
conda env create -f base_environment.yaml -n oellm_autoexp - Pack/Export the conda env:
conda-pack --name oellm_autoexp --output oellm_autoexp.tar.gz - Send this to
marenostrum:rsync oellm_autoexp.tar.gz marenostrum:~/ - Unpack it there:
mkdir oellm_autoexp_env ; cd oellm_autoexp_env ; tar -xzvf ../oellm_autoexp.tar.gz - Add the environment to your bashrc:
echo "source ~/oellm_autoexp_env/bin/activate" >> ~/.bashrcor load it when you need it - Use a container built on LEONARDO or JUWELS (MARENOSTRUM has no internet access to you can't build anything there)
- Copy datasets and tokenizer etc. manually (no web connection on compute and login nodes)
For LEONARDO, all should work with a pre-built container image. To build a container image on LEONARDO, please run these commands:
- Download the pytorch base image from nvcr.io:
singularity build --sandbox --fix-perms --force $CONTAINER_CACHE_DIR/pytorch__25.08-py3_sandbox docker://nvcr.io/nvidia/pytorch:25.08-py3 - Build the user-base container (in
container), but from a compute node:python build_container_user.py --backend megatron --definition MegatronTrainingNoRoot --append-date --container-cmd singularity --base-image $CONTAINER_CACHE_DIR/pytorch__25.08-py3_sandboxOtherwise, on the login node you run out of resources and get killed. Make sure also to have datasets and tokenizers downloaded before starting a job, as there is no web connection on the compute nodes.
# Plan + submit + monitor in one go (manifest written to outputs/manifests/)
python scripts/run_autoexp.py project=default
# Prefer explicit plan/submit?
python scripts/plan_autoexp.py --manifest outputs/manifests/demo.json project=default
python scripts/submit_autoexp.py --manifest outputs/manifests/demo.json
# Probe a specific sweep entry
python scripts/run_sweep_entry.py --sweep outputs/sweep.json --index 0 --dry-run- Every submission drops
<monitoring_state_dir>/<plan_id>.json. Resuming is symmetric:
python scripts/monitor_autoexp.py --session <plan_id>
# legacy path works too
python scripts/monitor_autoexp.py --manifest outputs/manifests/<plan>.json- The session file stores
resolved_log_path, last SLURM state, per-event history, and the manifest path so you can crash/restart without guessing log names.scripts/tests/test_monitor_resume.pyis the on-cluster smoke test that exercises plan → submit → interrupt → resume. - The action queue lives next to the session file (
<plan_id>.actions/). Use the CLI to keep it short and readable:
# Summaries (optionally filter by event)
python scripts/monitor_autoexp.py --session <plan_id> --cmd queue
python scripts/monitor_autoexp.py --session <plan_id> --cmd queue --queue-event checkpoint_saved
# Inspect or retry a specific entry
python scripts/monitor_autoexp.py --session <plan_id> --cmd queue \
--queue-id <queue_uuid> --queue-show
python scripts/monitor_autoexp.py --session <plan_id> --cmd queue \
--queue-id <queue_uuid> --queue-retry
# Run the worker that executes queued actions (for example, RunAutoexpAction)
python scripts/monitor_autoexp.py --session <plan_id> --cmd actionsEach entry is a single JSON file (<event_id>/<queue_id>.json) so manual inspection is always possible, but the CLI avoids editing files directly.
# Build once (see container/MegatronTraining*.def for examples)
./container/build_container.sh --backend megatron --definition MegatronTraining --output ./artifacts
# On LUMI or other clusters, where --fake-root for container builds is not available
python ./container/build_container_user.py --backend megatron --definition MegatronTrainingLumi \
--requirements container/megatron/requirements_latest.txt \
--output ./artifacts
# Plan inside the container, submit/monitor on host
python scripts/run_autoexp_container.py \
--config-ref experiments/megatron_jupiter_speed_test \
container=juwels
# Inspect or reuse the manifest later
python scripts/monitor_autoexp.py --manifest outputs/manifests/<plan>.jsonrun_autoexp_container.py accepts --no-submit, --no-monitor, and --monitor-only so the same manifest feeds both automation and manual debugging.
Monitoring behavior lives entirely in YAML. Keep it small, keep it explicit:
# config/monitoring/megatron_checkpoint_eval.yaml
log_events:
- name: checkpoint_saved
pattern: "saved checkpoint (?P<path>\\S+)"
pattern_type: regex
metadata:
kind: checkpoint
extract_groups:
checkpoint_path: path
actions:
- class_name: EventAction
mode: queue
conditions:
- class_name: FileExistsCondition
path: "{checkpoint_path}"
blocking: true
action:
class_name: RunCommandAction
# does not work yet
command: "python convert_checkpoint.py {checkpoint_path} {checkpoint_path}/../converted"
state_events:
- name: stall
state:
class_name: StalledState
actions:
- class_name: EventAction
action:
class_name: RestartAction
reason: "retry stall"
- name: success
state:
class_name: SuccessState
actions:
- class_name: EventAction
action:
class_name: LogAction
message: "run_finished"log_events describe detectors (regex/substring/inactivity). state_events wire SLURM transitions (stall, timeout, success, etc.) to actions. Because nothing is hidden in code, configs like config/monitoring/megatron_basic.yaml act as the canonical reference: fork it, add or remove events, and the controller will follow exactly what the YAML states.
- Use
pattern_type: inactivityto emit events when the log or additional output paths stop changing forinactivity_threshold_seconds. - Queue actions (
mode: queue) for sidecar jobs or evaluations; inline actions are for immediate restarts/logging. - Everything templated by way of
str.formatgets the merged metadata (job_id,checkpoint_path, etc.) so downstream automation stays simple.
For more context (provenance capture, fake SLURM client, restart CLI), refer to docs/ and the unit/integration tests—they mirror real-world usage.
For an update of the megatron backend, first check out the new submodule version. Then, create a new container. Within that container,
run the script generation in scripts/generate_megatron_config.py and scripts/generate_megatron_dataclass.py. You might have to adapt the transformer_engine mocks in those scripts.
Also, apparently some containers don't use the correct C++ path, you might have to export CXX=$(which clang++), for example on LUMI.