如何在 tf.data.TextLineDataset 上应用 tf.keras.preprocessing.text.Tokenizer?

原文标题How to apply tf.keras.preprocessing.text.Tokenizer on tf.data.TextLineDataset?

我正在加载一个TextLineDataset并且我想应用一个在文件上训练的标记器:

import tensorflow as tf

data = tf.data.TextLineDataset(filename)

MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts([x.numpy().decode('utf-8') for x in train_data])

现在我想在data上应用这个分词器,以便将每个单词替换为其编码值。我已经尝试过data.map(lambda x: tokenizer.texts_to_sequences(x))给出了OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

按照说明,当我将代码编写为:

@tf.function
def fun(x):
    return tokenizer.texts_to_sequences(x)
train_data.map(lambda x: fun(x))

我得到:OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature

那么如何在data上进行标记化呢?

原文链接:https://stackoverflow.com//questions/71679122/how-to-apply-tf-keras-preprocessing-text-tokenizer-on-tf-data-textlinedataset

回复

我来回复
  • AloneTogether的头像
    AloneTogether 评论

    问题是tf.keras.preprocessing.text.Tokenizer不打算在图形模式下使用。检查文档,fit_on_textstexts_to_sequences都需要字符串列表而不是张量。我会推荐使用tf.keras.layers.TextVectorization,但如果你真的想使用Tokenizer方法,试试这样的:

    import tensorflow as tf
    import numpy as np
    
    with open('data.txt', 'w') as f:
      f.write('this is a very important sentence \n')
      f.write('where is my cat actually?\n')
      f.write('fish are everywhere!\n')
    
    dataset = tf.data.TextLineDataset(['/content/data.txt'])
    
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts([n.numpy().decode("utf-8")for n in list(dataset.map(lambda x: x))])
    
    def tokenize(x):
      return tokenizer.texts_to_sequences([x.numpy().decode("utf-8")])
    
    dataset = dataset.map(lambda x: tf.py_function(tokenize, [x], Tout=[tf.int32])[0])
    
    for d in dataset:
      print(d)
    
    tf.Tensor([2 1 3 4 5 6], shape=(6,), dtype=int32)
    tf.Tensor([ 7  1  8  9 10], shape=(5,), dtype=int32)
    tf.Tensor([11 12 13], shape=(3,), dtype=int32)
    

    使用TextVectorizationlayer 看起来像这样:

    with open('data.txt', 'w') as f:
      f.write('this is a very important sentence \n')
      f.write('where is my cat actually?\n')
      f.write('fish are everywhere!\n')
    
    dataset = tf.data.TextLineDataset(['/content/data.txt'])
    
    vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
    vectorize_layer.adapt(dataset)
    
    dataset = dataset.map(vectorize_layer)
    
    2年前 0条评论