在 Python 中用完整形式替换首字母缩略词

xiaoxingxing python 234

原文标题Replacing acronyms with their full forms in Python

我有一个首字母缩略词词典,其中keys作为首字母缩写词和values作为完整形式。

我想用完整的表格替换text_list中的首字母缩略词以到达ouput_list

acronym_dict = {
    'QUO': 'Quotation',
    'IN': 'India',
    'SW': 'Software',
    'RE': 'Regular Expression'
}

text_list = [
    'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
    'The update does not belong to the SW, version, branch',
    'This is a RE_Text'
]

output_list = [
    'The status Quotation has changed',
    'I SWEAR, This is part of India_Software',
    'The update does not belong to the Software, version, branch',
    'This is Regular Expression_Text'
]

我写了一个方法来做到这一点

import string
def remove_punctuations(text):
    punct_str = string.punctuation  # !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
    for punctuation in punct_str:
        text = text.replace(punctuation, ' ')
    return text.strip()

def replace_single_acronym(text, acronym, fullform):
    words = text.split()
    return_words = []
    for w in words:
        if remove_punctuations(w).lower() == acronym.lower():
            return_words.append(w.replace(acronym, fullform))
        else:
            return_words.append(w)
    return " ".join(return_words)

my_op_list = []
for text in text_list:
    for acronym in acronym_dict.keys():
        text = replace_single_acronym(text, acronym, acronym_dict[acronym])
    my_op_list.append(text)

理想情况下output_listmy_op_list应该看起来一样。它打印以下结果(在 2 个实例中失败)

['The status Quotation has changed',
 'I SWEAR, This is part of IN_SW',
 'The update does not belong to the Software, version, branch',
 'This is a RE_Text']

此外,该方法replace_single_acronym在 1000 个text_list项的语料库上非常慢。

有人可以帮助我调整方法以正确有效地进行吗?

原文链接:https://stackoverflow.com//questions/71555412/replacing-acronyms-with-their-full-forms-in-python

回复

我来回复
  • Daweo的头像
    Daweo 评论

    您可以通过以下方式将函数作为第二个参数传递给此任务使用re.sub

    import re
    acronym_dict = {
        'QUO': 'Quotation',
        'IN': 'India',
        'SW': 'Software',
        'RE': 'Regular Expression'
    }
    
    text_list = [
        'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
        'The update does not belong to the SW, version, branch',
        'This is a RE_Text'
    ]
    def get_full_name(m):
        return acronym_dict.get(m.group(1),m.group(1))
    def replace_acronyms(text):
        return re.sub(r'(?<![A-Z])([A-Z]+)(?![A-Z])', get_full_name, text)
    output_list = [replace_acronyms(i) for i in text_list]
    print(output_list)
    

    输出:

    ['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']
    

    说明:在我使用的模式中,有两个零长度断言和一个捕获组,它确实找到了一个或多个大写 ASCII 字母,这些字母前面没有 ASCII 大写字母(负向后看),后面也没有 ASCII 大写字母(负向前看) ).get_full_name是用作re.sub的第二个参数的函数,因此它确实接受单个参数,即匹配。m.group(1)表示我在模式中使用的唯一捕获组的内容,它是首字母缩写词,我使用.getdict的函数所以如果给出的首字母缩写词是出现在 dict 键中然后使用相应的值,如果不只是使用该首字母缩写词 ie。不要更改任何内容。

    2年前 0条评论