Skip to content

NaN or Inf found in input tensor & KeyError: 'loss' #9

@cooelf

Description

@cooelf

Hi, I tried the baseline model but encountered some issues to report for help.

Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz

Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.

The exception,
0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'

My running script is,
python -m allennlp.run train MRQA_BERTLarge.jsonnet -s Models/baseline -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'data/train/SQuAD.jsonl.gz,data/train/NewsQA.jsonl.gz,data/train/HotpotQA.jsonl.gz,data/train/SearchQA.jsonl.gz,data/train/TriviaQA.jsonl.gz,data/train/NaturalQuestions.jsonl.gz', 'validation_data_path': 'data/in-domain/SQuAD.jsonl.gz,data/in-domain/NewsQA.jsonl.gz,data/in-domain/HotpotQA.jsonl.gz,data/in-domain/SearchQA.jsonl.gz,data/in-domain/TriviaQA.jsonl.gz,data/in-domain/NaturalQuestions.jsonl.gz', 'iterator':{'batch_size':6},'trainer': {'cuda_device': [0,1,2,3], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '145000'}}}" --include-package mrqa_allennlp

The warning is,
2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions