spacy 自定义句子分割器不工作[关闭]

心中带点小风骚 2年前 nlp 609

原文标题 ：spacy custom sentence segmenter is not working [closed]

我需要拆分复合句，所以我决定根据连词和标点符号来拆分它们。为此，我试图使用一个 spacy 自定义句子分段器，但它不起作用。

我的代码：

import spacy
from spacy.language import Language

nlp = spacy.load("en_core_web_sm")
doc = nlp("Coffee is good but Coffee wasn't hot enough.")

# custom segmenter
@Language.component("component")
def split_on_breaks(doc):
    
    punct = ['.', '?', '!']
    cc = [token.text for token in doc if token.dep_ == 'cc'] # conjuction

    for token in doc[:-1]:
        if token.text in punct + cc:
            doc[token.i+1].is_sent_start = True
    return doc

# adding function in pipeline      
nlp.add_pipe("component", before='parser')

# splitting
for setns in doc.sents:
    print(setns)

# output
Coffee is good but coffee wasn't hot enough.

# expected (split on conjunction 'but')
Coffee is good.
Coffee wasn't hot enough

我不知道为什么它不起作用。同样在一些示例中，我看到他们使用 spacy 的SentenceSegmenter函数将函数添加到管道中。我不知道这是否是正确的方法。

def split_on_breaks(doc):
    -------
    return ---

sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_breaks)
nlp.add_pipe(sbd, first=True)

我试图导入SentenceSegmenterusingfrom spacy. pipeline import SentenceSegmenter但它给出了一个错误。我不知道它是否被弃用。

原文链接：https://stackoverflow.com//questions/71435511/spacy-custom-sentence-segmenter-is-not-working

我来回复

polm23 评论
如果您查看您的代码，您会在此处获得连词：
```
cc = [token.text for token in doc if token.dep_ == 'cc'] # conjuction
```
但是你从来没有根据这些值设置is_sent_start。你应该有这样的东西：
```
    for token in doc[:-1]:
        if token.dep_ = 'cc':
            doc[token.i].is_sent_start = True
```
其他几件事要记住…

这并不总是按你想要的方式工作 – 如果你说“我吃了奶酪和饼干”，那么“and”就是一个cc那里。

spaCy 不会让您修改文档文本，因此它不会像您的“预期”示例中那样添加句点或更改大小写。

如果您为某些但不是所有标记设置句子边界，则 spaCy 中存在错误。它可能根本不会影响你，但它也有可能出现。如果你发现解析器忽略了你的设置，你可以在你的组件之前运行一个语句器来避免这个问题。

作为一个普遍的问题，像这样拆分句子被称为“句子简化”，是一个研究课题。您可能希望在将文本传递给 spaCy 之前对其进行预处理。（我们已经考虑在 spaCy 中为此添加功能，但还没有做任何工作。）
2年前 0条评论

spacy 自定义句子分割器不工作[关闭]

回复

相关问题