在训练 BERT 变体时,得到 IndexError: index out of range in self

青葱年少 pytorch 848

原文标题While training BERT variant, getting IndexError: index out of range in self

训练中XLMRobertaForSequenceClassification

xlm_r_model(input_ids = X_train_batch_input_ids
            , attention_mask = X_train_batch_attention_mask
            , return_dict = False
           )

我遇到以下错误:

Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1218, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

以下是详细信息:

  1. 创建模型 config = XLMRobertaConfig() config.output_hidden_​​states = Falsexlm_r_model = XLMRobertaForSequenceClassification(config=config)xlm_r_model.to(device) # device is device(type=’cpu’)
  2. Tokenizer xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained(‘xlm-roberta-large’)MAX_TWEET_LEN = 402>>> df_1000.info() # 描述我预先填充的数据框Int64Index: 1000条目,29639 到 44633Data 列(共 2 列):# Column Non-Null Count Dtype — —— ————– —– 0 text 1000 non-null object1 class 1000 non-null int64 dtypes: int64(1), object(1)memory usage: 55.7+ KBX_train = xlmr_tokenizer(list(df_1000[:800].text), padding=True, max_length=MAX_TWEET_LEN+5 , truncation=True) # +5: a head room for special tokens / separators>>> list(map(len,X_train[‘input_ids’])) # 为什么是 105?不应该是 MAX_TWEET_LEN+5 = 407 吗?[105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, …]>>> type(train_index ) # 描述(为清楚起见)我预先填充的训练折叠索引>>> train_index.size 640X_train_fold_input_ids = np.array(X_train[‘input_ids’])[train_index]X_train_fold_attention_mask = np.array(X_train [‘attention_mask’])[train_index]>>> i # batch id0>>> batch_size16X_train_batch_input_ids = X_train_fold_input_ids[i:i+batch_size]X_train_batch_input_ids = torch.tensor(X_train_batch_input_ids,dtype=torch.long).to(device)X_train_batch_attention_mask = X_train_fold_attention_mask[i:i+batch_size]X_train_batch_attention_mask = torch.tensor(X_train_batch_attention_mask,dtype=torch.long).to(device)>>> X_train_batch_input_ids.size()torch.Size([16, 105]) # 为什么是 105?这不应该是 MAX_TWEET_LEN+5 = 407 吗?>>> X_train_batch_attention_mask.size()torch.Size([16, 105]) # 为什么是 105?这不应该是 MAX_TWEET_LEN+5 = 407 吗?

在此之后,我拨打电话xlm_r_model(...)如本问题开头所述,并以指定的错误结束。

注意到所有这些细节,我仍然无法理解为什么我会收到指定的错误。我在哪里做错了?

原文链接:https://stackoverflow.com//questions/71984994/while-training-bert-variant-getting-indexerror-index-out-of-range-in-self

回复

我来回复
  • Mahesha999的头像
    Mahesha999 评论

    根据github上的这篇文章,可能有很多原因。以下是从该帖子中总结的原因列表(截至 2022 年 4 月 24 日,请注意,未测试第二和第三个原因):

    1. 分词器和 bert 模型的词汇量大小不匹配。这将导致分词器生成模型无法理解的 ID。参考
    2. 存在于不同设备(CPU、GPU、TPU)上的模型和数据 ref
    3. 长度超过 512 的序列(这是类 BERT 模型的最大值)参考

    就我而言,这是第一个原因,词汇大小不匹配,我已将其修复如下:

    这是我解决此问题的方法:

    xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
    config = XLMRobertaConfig() 
    config.vocab_size = xlmr_tokenizer.vocab_size  # setting both to have same vocab size
    
    2年前 0条评论