我正在尝试抓取网页中最常用的单词并过滤掉停用词
nlp 233
原文标题 :I’m trying to scrape most frequent words in a web page and filter out stop words
我的代码可以抓取最常用的单词,但是一旦我引入代码以尝试删除或过滤掉任何停用词,我的输出就很时髦。这是我的完整代码
nltk.download('stopwords')
stopwords = stopwords.words('english')
print(stopwords)
def start(url):
worldlist = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'html.parser')
for each_text in soup.findAll('div', {'class': 'centerPar'}):
content = each_text.text
words = content.lower().split()
for each_word in words:
worldlist.append(each_word)
clean_wordlist(worldlist)
def clean_wordlist(wordlist):
clean_list = []
for word in wordlist:
symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
for i in range(len(symbols)):
word = word.replace(symbols[i], '')
if len(word) > 0:
clean_list.append(word)
filter_list(clean_list)
def filter_list(clean_list):
filtered_list = []
for word in clean_list:
for i in range(len(stopwords)):
word = word.replace(stopwords[i], '')
if len(word) > 0:
filtered_list.append(word)
create_dict(filtered_list)
def create_dict(filtered_list):
word_count = {}
for word in filtered_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
c = Counter(word_count)
top = c.most_common(20)
print(top)
if __name__ != '__main__':
pass
else:
url = "https://www.pwc.com/us/en/about-us/purpose-and-values.html"
start(url)
我遇到了 filter_list 函数的问题
我的输出如下:
[('n', 82), ('e', 23), ('f', 17), ('r', 14), ('ce', 14), ('wk', 14), ('pwc', 13), ('gl', 13), ('c', 12), ('wh', 12), ('u', 12), ('l', 11), ('cn', 10), ('peple', 10), ('purpe', 9), ('’', 9), ('ke', 7), ('en', 7), ('h', 7), ('p', 7)]
回复
我来回复-
ewz93 评论
您将标记中的停用词替换为空字符串。
因此,如果令牌恰好是一个停用词,它的长度为 0 并被正确过滤。如果它不包含任何作为停用词的子字符串,那么它将被正确地完全附加。
在所有其他情况下(并且是任何文本中的大多数标记),标记中的停用词子字符串被替换,留下无意义的字符串。
E.g。if your word is
programming
。Your stopword list contains short words such as['it', 'am', 'in', 'i', 'a', ...]
。Your replacement results in:编程 –> 编程
要解决此问题,请检查整个单词字符串是否在停用词字符串列表中,从而跳过对
word
变量的整个临时修改:def filter_list(clean_list): filtered_list = [] for word in clean_list: if word not in stopwords: filtered_list.append(word) create_dict(filtered_list)
或作为列表理解:
def filter_list(clean_list): filtered_list = [word for word in clean_list if word not in stopwords] create_dict(filtered_list)
2年前