-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hi, I tried the baseline model but encountered some issues to report for help.
Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz
Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.
The exception,
0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'
My running script is,
python -m allennlp.run train MRQA_BERTLarge.jsonnet -s Models/baseline -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'data/train/SQuAD.jsonl.gz,data/train/NewsQA.jsonl.gz,data/train/HotpotQA.jsonl.gz,data/train/SearchQA.jsonl.gz,data/train/TriviaQA.jsonl.gz,data/train/NaturalQuestions.jsonl.gz', 'validation_data_path': 'data/in-domain/SQuAD.jsonl.gz,data/in-domain/NewsQA.jsonl.gz,data/in-domain/HotpotQA.jsonl.gz,data/in-domain/SearchQA.jsonl.gz,data/in-domain/TriviaQA.jsonl.gz,data/in-domain/NaturalQuestions.jsonl.gz', 'iterator':{'batch_size':6},'trainer': {'cuda_device': [0,1,2,3], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '145000'}}}" --include-package mrqa_allennlp
The warning is,
2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.