如何根据两个字符串的相似度计算一个值

心中带点小风骚 2年前 nlp 176

原文标题 ：How to calculate a value based on the similarity of two strings

我有两个字符串，我想根据这两个字符串中出现的相似单词找到一个值

actual = ['I', 'am', 'a', 'student', 'from', 'computer', 'science', 'department']
predicted = ['computer', 'and', 'science', 'department']

以上是我想比较的两个字符串示例。

Ex : 与actualstring相比，predictedstring中出现了3个相似词

我想要得到的预期输出是实际和预测字符串的长度以及类似的出现次数（单词），在这种情况下为 3

length of actual = 8
length of predicted = 4
similar word count = 3

原文链接：https://stackoverflow.com//questions/71903653/how-to-calculate-a-value-based-on-the-similarity-of-two-strings

回复

我来回复

linuskmr 评论
我的想法是将actual和predicted转换为集合，然后构造两者的交集。但是请注意，这不适用于多次出现，因为集合不包含重复项，因此actual = ['computer', 'computer', 'computer']和predicted = ['computer', 'computer', 'computer']的相似字数为 1。
```
actual = ['I', 'am', 'a', 'student', 'from', 'computer', 'science', 'department']
predicted = ['computer', 'and', 'science', 'department']

print("length of actual = ", len(actual))
print("length of predicted = ", len(predicted))
print("similar word count = ", len(set(actual).intersection(set(predicted))))
```
2年前 0条评论
BrokenBenchmark 评论
如果您只想在相似度计算中计算一次相同的单词，则 linuskmr 的答案非常好。

但是，如果您希望能够多次计算同一个单词，则可以使用collections.Counter而不是集合。您可以使用&找到出现在两个列表中的单词以及这两个列表之间该单词的最小计数。实际的字数存储在计数器的值中，而我们要找出常见出现的总数，所以我们使用.values()和sum()：
```
from collections import Counter

result = sum((Counter(actual) & Counter(predicted)).values())

print(result)
```
这输出：
```
3
```
2年前 0条评论