在 R 中,检查短语是否出现在自由文本字段中

xiaoxingxing nlp 197

原文标题In R, to check if phrases occur in free text field

我有一个如下的短语列表(实际上,还有更多)。

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

我还在数据框 (df) 中有一个列,其中包含很多自由文本。让我们将此列称为“评论”(df$Comments)。

总体目标是检查短语是否出现在列中。

出于我的目的,当至少存在 3 个元素时,将设置为出现在文本中的短语。例如,如果文本包含“covid19”、“coronavirus”、“ill”等词,那么我说该文本中出现了短语_1。另一方面,如果它仅包含“covid19”和“coronavirus”,则不会出现该短语。

我的目标是向 df 添加 2 列 – “短语数”和“发生的短语”。

我想知道“短语数”列中的文本中出现了多少个短语,以及这些短语的名称(例如,如果文本包含“covid19”、“coronavirus”、“ill”、 it”,”issue”,”problem” – 那么 df$’Number of phrases’ = 2, and df$’Phrases that occurred’= “Phrase1”, “Phrase3”)。

下面是我尝试过的代码,但不知何故,我认为必须有一种更有效的方法来实现这一点。

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

phrase_list <- list(phrase_1,phrase_2,phrase_3)

phrase.tally.list <- data.frame()
percentage.table <-data.frame()

for(i in 1:length(phrase_list)){
  phrase <- phrase_list[[i]]
  for(j in 1:dim(df)[1]){
    x <- 0
    for(k in 1:length(phrase)){
      if(length(grep(pattern = phrase[[k]], x = df$Comments[j]))>0){x=x+1}else{x=x+0}
    }
    if(x>2){x=1}else{x=0} ## Phrase Count Threshold
    phrase.tally.list[j,i]<- x
    names(phrase.tally.list)[i]<-paste("Phrase",i,sep=" ")
  }
} ``



原文链接:https://stackoverflow.com//questions/71424333/in-r-to-check-if-phrases-occur-in-free-text-field

回复

我来回复
  • danlooo的头像
    danlooo 评论
    library(tidyverse)
    
    phrase_1 <- c("covid19", "coronavirus", "ill", "illness", "pandemic")
    phrase_2 <- c("e-mail", "email", "sent", "attachment", "recipient", "data", "signature", "disclose")
    phrase_3 <- c("it", "issue", "problem", "server", "network")
    
    search_corpus <- list(
      phrase_1 = phrase_1,
      phrase_2 = phrase_2,
      phrase_3 = phrase_3
    ) %>%
      enframe(value = "word") %>%
      unnest(word)
    
    
    df <- tibble(
      id = seq(3),
      Comments = c("Lorem ipsum covid19",
         "Lorem ipsum it issue problem", "Lorem ipsum")
    )
    
    df %>%
      separate_rows(Comments, sep = " ") %>%
      inner_join(search_corpus, by = c("Comments" = "word"))
    #> # A tibble: 4 × 3
    #>      id Comments name    
    #>   <int> <chr>    <chr>   
    #> 1     1 covid19  phrase_1
    #> 2     2 it       phrase_3
    #> 3     2 issue    phrase_3
    #> 4     2 problem  phrase_3
    
    df %>%
      separate_rows(Comments, sep = " ") %>%
      inner_join(search_corpus, by = c("Comments" = "word")) %>%
      count(id, name)
    #> # A tibble: 2 × 3
    #>      id name         n
    #>   <int> <chr>    <int>
    #> 1     1 phrase_1     1
    #> 2     2 phrase_3     3
    

    由 thereprex 包创建于 2022-03-10 (v2.0.0)

    2年前 0条评论