在 R 中，检查短语是否出现在自由文本字段中

xiaoxingxing 2年前 nlp 197

原文标题 ：In R, to check if phrases occur in free text field

我有一个如下的短语列表（实际上，还有更多）。

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

我还在数据框 (df) 中有一个列，其中包含很多自由文本。让我们将此列称为“评论”（df$Comments）。

总体目标是检查短语是否出现在列中。

出于我的目的，当至少存在 3 个元素时，将设置为出现在文本中的短语。例如，如果文本包含“covid19”、“coronavirus”、“ill”等词，那么我说该文本中出现了短语_1。另一方面，如果它仅包含“covid19”和“coronavirus”，则不会出现该短语。

我的目标是向 df 添加 2 列 – “短语数”和“发生的短语”。

我想知道“短语数”列中的文本中出现了多少个短语，以及这些短语的名称（例如，如果文本包含“covid19”、“coronavirus”、“ill”、 it”,”issue”,”problem” – 那么 df$’Number of phrases’ = 2, and df$’Phrases that occurred’= “Phrase1”, “Phrase3”)。

下面是我尝试过的代码，但不知何故，我认为必须有一种更有效的方法来实现这一点。

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

phrase_list <- list(phrase_1,phrase_2,phrase_3)

phrase.tally.list <- data.frame()
percentage.table <-data.frame()

for(i in 1:length(phrase_list)){
  phrase <- phrase_list[[i]]
  for(j in 1:dim(df)[1]){
    x <- 0
    for(k in 1:length(phrase)){
      if(length(grep(pattern = phrase[[k]], x = df$Comments[j]))>0){x=x+1}else{x=x+0}
    }
    if(x>2){x=1}else{x=0} ## Phrase Count Threshold
    phrase.tally.list[j,i]<- x
    names(phrase.tally.list)[i]<-paste("Phrase",i,sep=" ")
  }
} ``

原文链接：https://stackoverflow.com//questions/71424333/in-r-to-check-if-phrases-occur-in-free-text-field

我来回复

danlooo 评论

library(tidyverse)

phrase_1 <- c("covid19", "coronavirus", "ill", "illness", "pandemic")
phrase_2 <- c("e-mail", "email", "sent", "attachment", "recipient", "data", "signature", "disclose")
phrase_3 <- c("it", "issue", "problem", "server", "network")

search_corpus <- list(
  phrase_1 = phrase_1,
  phrase_2 = phrase_2,
  phrase_3 = phrase_3
) %>%
  enframe(value = "word") %>%
  unnest(word)


df <- tibble(
  id = seq(3),
  Comments = c("Lorem ipsum covid19",
     "Lorem ipsum it issue problem", "Lorem ipsum")
)

df %>%
  separate_rows(Comments, sep = " ") %>%
  inner_join(search_corpus, by = c("Comments" = "word"))
#> # A tibble: 4 × 3
#>      id Comments name    
#>   <int> <chr>    <chr>   
#> 1     1 covid19  phrase_1
#> 2     2 it       phrase_3
#> 3     2 issue    phrase_3
#> 4     2 problem  phrase_3

df %>%
  separate_rows(Comments, sep = " ") %>%
  inner_join(search_corpus, by = c("Comments" = "word")) %>%
  count(id, name)
#> # A tibble: 2 × 3
#>      id name         n
#>   <int> <chr>    <int>
#> 1     1 phrase_1     1
#> 2     2 phrase_3     3

由 thereprex 包创建于 2022-03-10 (v2.0.0)

2年前 0条评论

在 R 中，检查短语是否出现在自由文本字段中

回复

相关问题