训练用于标记分类的 CamelBERT 模型

乘风 deep-learning 216

原文标题Training CamelBERT model for token classification

我正在尝试使用拥抱脸模型(CamelBERT)使用 ANERCorpDataset 进行令牌分类。我从 ANERCorp 提供了训练集来训练模型,但出现以下错误。

错误:

Some weights of the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/16/2022 07:31:01 - INFO - utils -   Creating features from dataset file at /content/drive/MyDrive/ANERcorp-CamelLabSplits
03/16/2022 07:31:01 - INFO - utils -   Writing example 0 of 3973
Traceback (most recent call last):
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 381, in <module>
    main()
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 226, in main
    if training_args.do_train
  File "/content/CAMeLBERT/token-classification/utils.py", line 132, in __init__
    pad_token_label_id=self.pad_token_label_id,
  File "/content/CAMeLBERT/token-classification/utils.py", line 210, in convert_examples_to_features
    label_ids.extend([label_map[label]] +
KeyError: 'B-LOC'

请注意:我正在使用 Google Colab 来训练模型。代码:

DATA_DIR="/content/drive/MyDrive/ANERcorp-CamelLabSplits"
MAX_LENGTH=512
BERT_MODEL="CAMeL-Lab/bert-base-arabic-camelbert-ca"
OUTPUT_DIR="/content/Output"
BATCH_SIZE=32
NUM_EPOCHS=3
SAVE_STEPS=750
SEED=12345

!python /content/CAMeLBERT/token-classification/run_token_classification.py \
--data_dir $DATA_DIR \
--task_type ner \
--labels $DATA_DIR/train.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--do_predict

原文链接:https://stackoverflow.com//questions/71493915/training-camelbert-model-for-token-classification

回复

我来回复
  • ewz93的头像
    ewz93 评论

    您正在使用的脚本从$DATA_DIR/train.txt加载标签。

    有关模型的预期,请参阅 https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py#L105。

    然后它尝试将标签列表作为第一个文件文件从语料库中加载(甚至在加载训练数据之前),请参阅 https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py# L183并放入label_map。

    但这由于某种原因失败了。我的假设是它没有找到任何东西并且 label_map 是一个空字典,因此第一次尝试从中获取标签的尝试因 KeyError 而失败。可能您的输入数据不存在或不在预期的路径中(检查您是否拥有正确的文件和$DATA_DIR的正确值)。根据我的经验,Google Drive 中的相对路径可能很棘手。尝试一些简单的事情来看看它是否有效,比如os.listdir($DATA_DIR)看看这是否真的是你期望的那样。

    如果这不是问题,那么可能是关于标签的某些问题实际上是错误的。ANERCorp 是否使用这种精确的标签书写方式(B-LOC等)?如果它不同(例如B-Location或其他东西),它也会失败。

    2年前 0条评论