在 Pandas 数据框中的文本列上运行我的 spacy 汇总函数时出错

xiaoxingxing nlp 487

原文标题Error running my spacy summarization function on a text column in pandas dataframe

下面是一个用于总结的spacy函数,我试图通过pandas数据框列运行这个函数,我每次都得到空列,我希望有人能帮我解决这个问题?

def summarize(text, per):
    nlp = spacy.load('en_core_web_sm')
    doc= nlp(text)
    tokens=[token.text for token in doc]
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]
    select_length=int(len(sentence_tokens)*per)
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

它还为示例文本吐出一个空结果:

text = 'gov charlie crist launched amounts nuclear attack republican politics fox news sunday showdown marco rubio crist labeled rubio house speaker tax raiser forth record tax issues crist singled rubios failed 2007 plan eliminated property taxes floridians exchange increase state sales tax tax swap massive tax increase crist said march 28 2010 senate debate respect speaker youve got tell truth people thats rubio contends tax swap huge net tax cut plan supported gov jeb bush tax cut tax hike lets look months speaker early 2007 rubio proposed fundamental change floridas tax structure proposal scratch property taxes primary residences place state sales tax increased 25 cents dollar subject voter approval house analysis originally said swap save taxpayers total 58 billion year certainly contrary crists claim saved money spent money end year likely depended individual circumstances 2007 st petersburg times ran calculations rubios proposal homeowners renters homeowners family annual income 64280 home value 241100 current property tax tampa 506106 sales taxes paid 951 proposed property tax tampa 0 sales taxes paid 1290 rubios plan homeowners paid 4722 state taxes times contrast renters renters family annual income 46914 current rent 851 sales taxes paid 691 proposed rent 851 sales taxes paid 937 rubios plan renters pay additional 246 year taxes rental property owners pay property taxes meaning rent wouldnt affected talked swap swap owned home wouldnt pay tax anymore crist said debate percent fellow floridians renters applied enjoyed tax increase rubio responded renters opportunity buy exorbitant taxes pay property florida gone conversely rubio pointed increased sales tax bring revenue state nonresident visitors tourists contribute said floridians contribute rubios proposal got seal approval grover norquist president americans tax reform rubio supporter 2007 wrote legislators saying rubios tax swap proposal amounted net tax cut speaker rubios proposal net tax cut vote proposal constitute violation taxpayer protection pledge norquist wrote taxpayers florida reap benefits lower tax burden significant spending restraint state local level later house study said sales tax increase generate 93 billion exchange eliminating 158 billion property taxes heres house analysis swap combined tax initiatives tallahassee bunch politicians declare 7 billion net tax savings tax increase rep adam hasner rdelray beach told palm beach post vote proposal saying tax increase swap ultimately killed state senate crist spokeswoman andrea saul noted rubio said tax swap tax increase march 28 2010 debate according transcripts rubio said let tell supposed program raise taxes keeps talking probably largest tax increase floridas history eliminated property taxes sorts people supported jeb bush rubio spokesman alberto martinez said rubio mispoke shocking try distort martinez said based statements surround rubios largest tax increase line reasonable meant decrease crist said rubios tax swap proposal massive tax increase basic level rubios proposal tax increase tax decrease state sales tax property taxes micro level people pay pay macro level different studies said floridians paid 58 billion 65 billion generally leery tax impact projections suggestion rubios plan resulted tax increase statewide certainly massive crist suggests'
summarize(text)

我不知道该函数是错误的还是其他原因,但后来我尝试通过数据框列运行它,我再次得到一个空列:

df['spacy_summary'] = df['final'].apply(lambda x: summarize(x, 0.05))

所以我猜这是功能?所以任何帮助表示赞赏。谢谢!

原文链接:https://stackoverflow.com//questions/71504226/error-running-my-spacy-summarization-function-on-a-text-column-in-pandas-datafra

回复

我来回复
  • ewz93的头像
    ewz93 评论

    您的文本摘要的逻辑假设 SpaCy 可以识别有效的句子,但您的示例文本没有提供。 SpaCy 可能会将所有内容放在一个长句子中,我认为您输入的文本不会分成多个句子。句子分割需要带有标点符号等的有效文本输入。尝试使用由 SpaCy 可识别的多个句子组成的文本。

    这与您使用int(len(sentence_tokens)*per)的事实相结合。 int 转换向下舍入到下一个较小的完整数字。所以int(1*0.05) = int(0.05) = 0,也就是它返回 0 个句子。对于分段句子少于 20 个的每个文本都会发生这种情况。所以改变这个比例或使用类似max(1, int(len(sentence_tokens)*per))的东西。

    我认为除此之外,代码通常应该可以工作。虽然我没有看每一个细节。但我不确定你是否确切地知道它的作用:它通过仅保留最具代表性的完整句子的per份额来进行总结,它在单词级别上没有任何改变。

    2年前 0条评论