如果我不提供 oov_token,tenosrflow 中的 Tokenizer 如何处理词汇表外的标记?
tensorflow 227
原文标题 :How does Tokenizer in tenosrflow deal with out of vocabulary tokens if I don’t provide oov_token?
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)
padded_sequence = pad_sequences(encoded_docs, maxlen=60)
test_tweets = tokenizer.texts_to_sequences(X_test)
test_padded_sequence = pad_sequences(test_tweets, maxlen=60)
即使我没有提供oov_token
参数,我也没有收到该代码的任何错误。我预计会出现错误test_tweets = tokenizer.texts_to_sequences(X_test)
不提供oov_token
的时候,tensorflow是如何处理考场外词汇量不足的?
回复
我来回复-
AloneTogether 评论
该回答已被采纳!
默认情况下,OOV 单词将被忽略/丢弃,如果
oov_token
isNone
:import tensorflow as tf tokenizer = tf.keras.preprocessing.text.Tokenizer() tokenizer.fit_on_texts(['hello world']) print(tokenizer.word_index) sequences = tokenizer.texts_to_sequences(['hello friends']) print(sequences)
{'hello': 1, 'world': 2} [[1]]
2年前