标记文本，但将复合连字词放在一起

扎眼的阳光 2年前 nlp 401

原文标题 ：Tokenize text but keep compund hyphenated words together

我正在尝试使用预处理功能清理文本。我想删除所有非字母字符，例如标点符号和数字，但我想保留使用破折号而不拆分它们的复合词（例如。pre-tender ，施工前）。

def preprocess(text):
  #remove punctuation
  text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = text.split()
  text = " ".join(text)
  return text

比如原文：

"Attended pre-tender meetings"

应该分成

['attended', 'pre-tender', 'meeting']

而不是

['attended', 'pre', 'tender', 'meeting']

任何帮助，将不胜感激！

原文链接：https://stackoverflow.com//questions/71659125/tokenize-text-but-keep-compund-hyphenated-words-together

我来回复

暂无回复内容

标记文本，但将复合连字词放在一起

回复

相关问题