在 Python 中用完整形式替换首字母缩略词
python 348
原文标题 :Replacing acronyms with their full forms in Python
我有一个首字母缩略词词典,其中keys
作为首字母缩写词和values
作为完整形式。
我想用完整的表格替换text_list
中的首字母缩略词以到达ouput_list
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
output_list = [
'The status Quotation has changed',
'I SWEAR, This is part of India_Software',
'The update does not belong to the Software, version, branch',
'This is Regular Expression_Text'
]
我写了一个方法来做到这一点
import string
def remove_punctuations(text):
punct_str = string.punctuation # !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
for punctuation in punct_str:
text = text.replace(punctuation, ' ')
return text.strip()
def replace_single_acronym(text, acronym, fullform):
words = text.split()
return_words = []
for w in words:
if remove_punctuations(w).lower() == acronym.lower():
return_words.append(w.replace(acronym, fullform))
else:
return_words.append(w)
return " ".join(return_words)
my_op_list = []
for text in text_list:
for acronym in acronym_dict.keys():
text = replace_single_acronym(text, acronym, acronym_dict[acronym])
my_op_list.append(text)
理想情况下output_list
和my_op_list
应该看起来一样。它打印以下结果(在 2 个实例中失败)
['The status Quotation has changed',
'I SWEAR, This is part of IN_SW',
'The update does not belong to the Software, version, branch',
'This is a RE_Text']
此外,该方法replace_single_acronym
在 1000 个text_list
项的语料库上非常慢。
有人可以帮助我调整方法以正确有效地进行吗?
回复
我来回复-
Daweo 评论
您可以通过以下方式将函数作为第二个参数传递给此任务使用
re.sub
import re acronym_dict = { 'QUO': 'Quotation', 'IN': 'India', 'SW': 'Software', 'RE': 'Regular Expression' } text_list = [ 'The status QUO has changed', 'I SWEAR, This is part of IN_SW', 'The update does not belong to the SW, version, branch', 'This is a RE_Text' ] def get_full_name(m): return acronym_dict.get(m.group(1),m.group(1)) def replace_acronyms(text): return re.sub(r'(?<![A-Z])([A-Z]+)(?![A-Z])', get_full_name, text) output_list = [replace_acronyms(i) for i in text_list] print(output_list)
输出:
['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']
说明:在我使用的模式中,有两个零长度断言和一个捕获组,它确实找到了一个或多个大写 ASCII 字母,这些字母前面没有 ASCII 大写字母(负向后看),后面也没有 ASCII 大写字母(负向前看) ).
get_full_name
是用作re.sub
的第二个参数的函数,因此它确实接受单个参数,即匹配。m.group(1)
表示我在模式中使用的唯一捕获组的内容,它是首字母缩写词,我使用.get
dict
的函数所以如果给出的首字母缩写词是出现在 dict 键中然后使用相应的值,如果不只是使用该首字母缩写词 ie。不要更改任何内容。2年前