在 R 中,检查短语是否出现在自由文本字段中
nlp 240
原文标题 :In R, to check if phrases occur in free text field
我有一个如下的短语列表(实际上,还有更多)。
phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")
我还在数据框 (df) 中有一个列,其中包含很多自由文本。让我们将此列称为“评论”(df$Comments
)。
总体目标是检查短语是否出现在列中。
出于我的目的,当至少存在 3 个元素时,将设置为出现在文本中的短语。例如,如果文本包含“covid19”、“coronavirus”、“ill”等词,那么我说该文本中出现了短语_1。另一方面,如果它仅包含“covid19”和“coronavirus”,则不会出现该短语。
我的目标是向 df 添加 2 列 – “短语数”和“发生的短语”。
我想知道“短语数”列中的文本中出现了多少个短语,以及这些短语的名称(例如,如果文本包含“covid19”、“coronavirus”、“ill”、 it”,”issue”,”problem” – 那么 df$’Number of phrases’ = 2, and df$’Phrases that occurred’= “Phrase1”, “Phrase3”)。
下面是我尝试过的代码,但不知何故,我认为必须有一种更有效的方法来实现这一点。
phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")
phrase_list <- list(phrase_1,phrase_2,phrase_3)
phrase.tally.list <- data.frame()
percentage.table <-data.frame()
for(i in 1:length(phrase_list)){
phrase <- phrase_list[[i]]
for(j in 1:dim(df)[1]){
x <- 0
for(k in 1:length(phrase)){
if(length(grep(pattern = phrase[[k]], x = df$Comments[j]))>0){x=x+1}else{x=x+0}
}
if(x>2){x=1}else{x=0} ## Phrase Count Threshold
phrase.tally.list[j,i]<- x
names(phrase.tally.list)[i]<-paste("Phrase",i,sep=" ")
}
} ``
回复
我来回复-
danlooo 评论
library(tidyverse) phrase_1 <- c("covid19", "coronavirus", "ill", "illness", "pandemic") phrase_2 <- c("e-mail", "email", "sent", "attachment", "recipient", "data", "signature", "disclose") phrase_3 <- c("it", "issue", "problem", "server", "network") search_corpus <- list( phrase_1 = phrase_1, phrase_2 = phrase_2, phrase_3 = phrase_3 ) %>% enframe(value = "word") %>% unnest(word) df <- tibble( id = seq(3), Comments = c("Lorem ipsum covid19", "Lorem ipsum it issue problem", "Lorem ipsum") ) df %>% separate_rows(Comments, sep = " ") %>% inner_join(search_corpus, by = c("Comments" = "word")) #> # A tibble: 4 × 3 #> id Comments name #> <int> <chr> <chr> #> 1 1 covid19 phrase_1 #> 2 2 it phrase_3 #> 3 2 issue phrase_3 #> 4 2 problem phrase_3 df %>% separate_rows(Comments, sep = " ") %>% inner_join(search_corpus, by = c("Comments" = "word")) %>% count(id, name) #> # A tibble: 2 × 3 #> id name n #> <int> <chr> <int> #> 1 1 phrase_1 1 #> 2 2 phrase_3 3
由 thereprex 包创建于 2022-03-10 (v2.0.0)
2年前