根据另一行的值更改一行中的单元格值

Question

我正在尝试更改包含数千行的数据框，每行看起来都像以下变体之一：

table, th, td {
  border: 1px solid black
}

<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td>  x and  y </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td>  a;  b </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td>  j </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
</table>

并将其更改为：

table, th, td {
  border: 1px solid black
}

    <table>
      <tr>
        <th> a </th>
        <th> b </th>
        <th> c </th>
      </tr>
      <tr>
        <td>  x and  y </td>
        <td>  x </td>
        <td>  y </td>
      </tr>
      <tr>
        <td>  a;  b </td>
        <td>  a </td>
        <td>  b </td>
      </tr>
      <tr>
        <td>  j </td>
        <td>  j </td>
        <td> NA </td>
      </tr>
    </table>

这是我当前实现此目的的代码（我使用美元符号的数量，因为这是确定交易数量的唯一一致值）：

（格式为 data.table，以防万一）

  df$b[(str_count(df$a, pattern = "\$") == 2)] = unlist(strsplit(df$a, " and "))[1]
  df$c[(str_count(df$a, pattern = "\$") == 2)] = unlist(strsplit(df$a, " and "))[2]
  df$b[str_count(df$a, pattern = "\$") < 2] = df$a

现在，我得到的不是预期的结果：

table, th, td {
  border: 1px solid black
}

<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td>  x and  y </td>
    <td>  x </td>
    <td>  y </td>
  </tr>
  <tr>
    <td>  a;  b </td>
    <td>  x</td>
    <td>  y</td>
  </tr>
  <tr>
    <td>  j </td>
    <td>  j </td>
    <td> NA </td>
  </tr>
</table>

有谁知道如何解决这个问题？我认为这与 strsplit() 正在获取第一个子集行并将其应用于子集中的每一行这一事实有关，但我不知道如何更改它才能正常工作。

Answer 1

不要尝试编写代码来解析 HTML，只需调用 HTML 解析器：

library(rvest)
library(tidyverse)

stage1 <- 
  "<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td>  x and  y </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td>  a;  b </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td>  j </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
</table>" %>% 
  rvest::minimal_html() %>% 
  rvest::html_node("table") %>% 
  rvest::html_table() %>% 
  as_tibble()

stage1

# A tibble: 3 x 3
  a              b     c    
  <chr>          <lgl> <lgl>
1  x and  y NA    NA   
2  a;  b   NA    NA   
3  j           NA    NA

现在使用 separate 和正则表达式

清理 stage1

stage1 %>% 
  select(a) %>% 
  separate(col = "a", into = c("b", "c"), 
           sep = "(?ix) \s* (and|;) \s*",   # Perl stye regex, cases insensitive.
           remove = FALSE, 
           fill= "right")


  a              b     c    
  <chr>          <chr> <chr>
1  x and  y  x  y 
2  a;  b    a  b
3  j            j  NA

Answer 2

您可以使用 stringr

中的 str_split_fixed

stringr::str_split_fixed(df$a, '\s*(;|and)\s*', 2)

#       [,1]    [,2]   
#[1,] " x" " y" 
#[2,] " a" " b"
#[3,] " j"  ""

根据另一行的值更改一行中的单元格值

Changing cell values in a row conditioned on values of another row

r

strsplit

data.table