填写字母所在的单词

Question

我正在处理击键数据，需要找到击键所在的单词。因为可能存在不可见的击键（如 Shift）或删除的击键，所以这不是一个简单的问题，我可以只迭代击键的索引并找到单词。相反，我需要找到击键产生的 space 分隔的单词。我确实有全文和现有文本可用，我应该能够利用它们。我尝试过使用 fill()、lag() 和 cumsum() 的解决方案，但 none 有效。

我有一个如下所示的数据框，我按 experiment_id:

分组

x <- tibble(
  experiment_id = rep(c('1a','1b'),each=12),
  keystroke = rep(c('a','SPACE','SHIFT','b','e','DELETE','a','d','SPACE','m','a','n'),2),
  existing_text = rep(c('a','a ','a ','a B','a Be','a B','a Ba','a Bad','a Bad ',
                    'a Bad m','a Bad ma','a Bad man'),2),
  final_text = 'a Bad man'
)

附加列应如下所示，其中 SPACE 属于它后面的单词，DELETE 和删除的击键是最后一个单词的一部分：

within_word = c('a','a','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','man','man','man')

有没有办法得出这个？

编辑以获得更多帮助： 在答案下方的评论中，@Onyambu 提到使用 keystroke 列有一个更简单的解决方案。我发现在我更大、更复杂的数据中 existing_text 并不总是可靠的。我强烈希望主要依赖 keystroke 的解决方案。由于删除，我还添加了并发症。

Answer 1

x %>%
  mutate(ww = str_remove(existing_text, fixed(lag(existing_text, default = ".")))) %>%
  group_by(grp = cumsum(ww== ' '|lag(ww == ' ', default = F))) %>%
  mutate(within_word = str_c(ww, collapse = ''),
         within_word = na_if(within_word, ' '))

# A tibble: 10 x 6
# Groups:   grp [5]
   keystroke existing_text final_text ww      grp within_word
   <chr>     <chr>         <chr>      <chr> <int> <chr>      
 1 a         "a"           a Bad man  "a"       0 a          
 2 SPACE     "a "          a Bad man  " "       1 NA         
 3 SHIFT     "a "          a Bad man  ""        2 Bad        
 4 b         "a B"         a Bad man  "B"       2 Bad        
 5 a         "a Ba"        a Bad man  "a"       2 Bad        
 6 d         "a Bad"       a Bad man  "d"       2 Bad        
 7 SPACE     "a Bad "      a Bad man  " "       3 NA         
 8 m         "a Bad m"     a Bad man  "m"       4 man        
 9 a         "a Bad ma"    a Bad man  "a"       4 man        
10 n         "a Bad man"   a Bad man  "n"       4 man

Answer 2

以下是两种方法：

第一个仅使用 existing_text 中的信息进行分组，并根据此分组和 keystroke.

构造 within_words 列

第二种方法仅使用 keystroke 中的信息。

第一种方法：基于 existing_text 的分组和基于 keystroke 的内容：

我们采取三个步骤：

首先，我们根据 strsplit 计算分组，我们在其中查找前面有单词 \w 的空格 \s。我们需要更正 "SHIFT" 的值，因为它们应该计入 "SPACE" 之后的单词。

第二步是将 "SHIFT"（以及示例数据不包含的所有其他类似函数）替换为 ""。

第三，我们用paste0(..., collapse = "")折叠字符串。

library(tidyverse)

x %>%

  # step1: construct grouping:
  mutate(word_grp = lengths(strsplit(existing_text, "(?<=\w)\s", perl = TRUE)) %>% 
           if_else(keystroke == "SHIFT", lead(., default = last(.)), .)) %>%
  group_by(experiment_id, word_grp) %>% 

  # step 2 & 3: first replace keys like "SHIFT" with "", the collapse with `paste0`
  mutate(within_word = str_replace_all(keystroke, c("SHIFT" = "", "SPACE" = "")) %>% 
           paste0(., collapse = ""))

#> # A tibble: 24 x 6
#> # Groups:   experiment_id, word_grp [6]
#>    experiment_id keystroke existing_text final_text word_grp within_word
#>    <chr>         <chr>     <chr>         <chr>         <int> <chr>      
#>  1 1a            a         "a"           a Bad man         1 a          
#>  2 1a            SPACE     "a "          a Bad man         1 a          
#>  3 1a            SHIFT     "a "          a Bad man         2 beDELETEad 
#>  4 1a            b         "a B"         a Bad man         2 beDELETEad 
#>  5 1a            e         "a Be"        a Bad man         2 beDELETEad 
#>  6 1a            DELETE    "a B"         a Bad man         2 beDELETEad 
#>  7 1a            a         "a Ba"        a Bad man         2 beDELETEad 
#>  8 1a            d         "a Bad"       a Bad man         2 beDELETEad 
#>  9 1a            SPACE     "a Bad "      a Bad man         2 beDELETEad 
#> 10 1a            m         "a Bad m"     a Bad man         3 man        
#> # … with 14 more rows

第二种方法：仅基于 keystrokes 中的信息。

这是一种仅使用 keystroke 中的信息的方法。但是，如果我们只想使用 keystroke 中的数据，事情就会变得更加费力。

以下是对以下步骤的简短说明：

步骤 1a：数据清理
我们需要清理 keystrokes 中的数据，以便它们可用于新列 within_word。这意味着两件事：(a) 我们需要用 "" 替换每个不应在 within_word 中打印的击键。在此之前，我们需要 (b) 根据该键的功能更改前导击键。在 SHIFT 的情况下，这意味着我们需要设置前导 keystroke toupper。对于您的示例数据，这非常简单，因为只有 SHIFT 我们需要处理。但是，在您的真实数据中可能有许多类似的其他键，例如 ALT 或 ^。所以我们需要为每个键重复步骤 1a。理想情况下，我们会想出一个函数，该函数采用键的名称和它在前导 keystroke 上使用的函数。请注意，我们尚未在此步骤中包含 "SPACE"，因为我们在步骤 2 中需要它。

要查看您需要在实际数据中处理多少个键，我们可以过滤那些 keystroke 不会更改 existing_text 的键。在您的示例数据中，这只是 SHIFT:

# get all keystrokes that don't change the existing_text directly
x %>% 
  select(keystroke, existing_text) %>% 
  filter(existing_text == lag(existing_text, default = ""))

#> # A tibble: 2 x 2
#>   keystroke existing_text
#>   <chr>     <chr>        
#> 1 SHIFT     "a "         
#> 2 SHIFT     "a "

步骤 2：创建分组
我们需要在 within_text 中创建单词分组。这是最复杂的一步。下面我们首先查找 within_word == "SPACE" 以及后续行是 != "SPACE" 的行。我们在结果上使用 data.table::rleid 以获得此变量的运行-length id。最后，我们需要为那些 within_word == "SPACE".

的行减去 1

第 3 步：最后一步前的数据准备
这基本上类似于步骤 1a，我们需要用 "" 替换 "SPACE" 因为我们不希望它出现在我们的结果中。但是，由于我们在步骤 2 中需要此列，因此我们必须在此步骤中完成数据清理。

第 4 步：折叠 within_word
中的字符串最后，我们按 experiment_id 和 word_grp 分组，并用 paste0(..., collapse = "").

折叠 within_word 中的字符串

library(tidyverse)

  # step 1a: data cleaning
  mutate(within_word = if_else(lag(keystroke, default = first(keystroke)) == "SHIFT",
                               toupper(keystroke),
                               keystroke) %>%
                          str_replace_all(., c("SHIFT" = ""))) %>%  
 
  # step 1b to 1n: repeat step 1a for other keys like ALT, ^ etc. 

  # step 2: create groups
  group_by(experiment_id) %>% 
  mutate(word_grp = data.table::rleid(
      within_word == "SPACE" & lead(within_word, default = first(keystroke)) != "SPACE"
    ) %>% if_else(within_word == "SPACE", . - 1L, .)) %>% 

  # step 3: data prep before final step
  ungroup %>% 
  mutate(within_word = str_replace(within_word, "SPACE", "")) %>%
 
  # step 4: collapse
  group_by(experiment_id, word_grp) %>% 
  mutate(within_word = paste0(within_word, collapse = ""))

#> # A tibble: 24 x 6
#> # Groups:   experiment_id, word_grp [6]
#>    experiment_id keystroke existing_text final_text within_word word_grp
#>    <chr>         <chr>     <chr>         <chr>      <chr>          <int>
#>  1 1a            a         "a"           a Bad man  a                  1
#>  2 1a            SPACE     "a "          a Bad man  a                  1
#>  3 1a            SHIFT     "a "          a Bad man  BeDELETEad         3
#>  4 1a            b         "a B"         a Bad man  BeDELETEad         3
#>  5 1a            e         "a Be"        a Bad man  BeDELETEad         3
#>  6 1a            DELETE    "a B"         a Bad man  BeDELETEad         3
#>  7 1a            a         "a Ba"        a Bad man  BeDELETEad         3
#>  8 1a            d         "a Bad"       a Bad man  BeDELETEad         3
#>  9 1a            SPACE     "a Bad "      a Bad man  BeDELETEad         3
#> 10 1a            m         "a Bad m"     a Bad man  man                5
#> # … with 14 more rows

^{由 reprex package (v0.3.0)}

于 2021-12-23 创建

填写字母所在的单词

Fill in word that letter is located in

nlp

r

tidyverse