我正在尝试抓取网页中最常用的单词并过滤掉停用词

xiaoxingxing 2年前 nlp 178

原文标题 ：I’m trying to scrape most frequent words in a web page and filter out stop words

我的代码可以抓取最常用的单词，但是一旦我引入代码以尝试删除或过滤掉任何停用词，我的输出就很时髦。这是我的完整代码

nltk.download('stopwords')

stopwords = stopwords.words('english')
print(stopwords)

def start(url):
    worldlist = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')
    for each_text in soup.findAll('div', {'class': 'centerPar'}):
        content = each_text.text
        words = content.lower().split()
        for each_word in words:
            worldlist.append(each_word)
        clean_wordlist(worldlist)

def clean_wordlist(wordlist):
    clean_list = []
    for word in wordlist:
        symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
        for i in range(len(symbols)):
            word = word.replace(symbols[i], '')
        if len(word) > 0:
            clean_list.append(word)
    filter_list(clean_list)

def filter_list(clean_list):
    filtered_list = []
    for word in clean_list:
        for i in range(len(stopwords)):
            word = word.replace(stopwords[i], '')
        if len(word) > 0:
            filtered_list.append(word)
    create_dict(filtered_list)

def create_dict(filtered_list):
    word_count = {}
    for word in filtered_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    c = Counter(word_count)
    top = c.most_common(20)
    print(top)

if __name__ != '__main__':
    pass
else:
    url = "https://www.pwc.com/us/en/about-us/purpose-and-values.html"
    start(url)

我遇到了 filter_list 函数的问题

我的输出如下：

[('n', 82), ('e', 23), ('f', 17), ('r', 14), ('ce', 14), ('wk', 14), ('pwc', 13), ('gl', 13), ('c', 12), ('wh', 12), ('u', 12), ('l', 11), ('cn', 10), ('peple', 10), ('purpe', 9), ('’', 9), ('ke', 7), ('en', 7), ('h', 7), ('p', 7)]

原文链接：https://stackoverflow.com//questions/71887718/im-trying-to-scrape-most-frequent-words-in-a-web-page-and-filter-out-stop-words

我来回复

ewz93 评论
您将标记中的停用词替换为空字符串。

因此，如果令牌恰好是一个停用词，它的长度为 0 并被正确过滤。如果它不包含任何作为停用词的子字符串，那么它将被正确地完全附加。

在所有其他情况下（并且是任何文本中的大多数标记），标记中的停用词子字符串被替换，留下无意义的字符串。

E.g。if your word isprogramming。Your stopword list contains short words such as['it', 'am', 'in', 'i', 'a', ...]。Your replacement results in:

编程 –> 编程

要解决此问题，请检查整个单词字符串是否在停用词字符串列表中，从而跳过对word变量的整个临时修改：
```
def filter_list(clean_list):
    filtered_list = []
    for word in clean_list:
        if word not in stopwords:
            filtered_list.append(word)
    create_dict(filtered_list)
```
或作为列表理解：
```
def filter_list(clean_list):
    filtered_list = [word for word in clean_list if word not in stopwords]
    create_dict(filtered_list)
```
2年前 0条评论

我正在尝试抓取网页中最常用的单词并过滤掉停用词

回复

相关问题