删除 lag == 0 的成对行并使用 dplyr 和链接计算变化百分比

Question

我正在处理一个非常大的 tibble，我想计算这些 table 随时间增长的百分比（第一个条目到最后一个条目，而不是最大到最小）。我最终还想将任何 table 更改为自己的 list/tibble，但将它们从原始输出 table.

中删除

数据集的示例如下所示：

  date    tbl_name    row_cnt
2/12/2019  first       247
6/5/2019   first       247
4/24/2019  second    3617138
6/5/2019   second    3680095
3/1/2019   third    62700321
6/5/2019   third    63509189
4/24/2019  fourth       2
6/5/2019   fourth       2
...          ...       ...

并且 table 的预期输出将是两个 table，它们将显示为：

tbl_name   pct_change
second       1.74
third        1.29
...          ...


tbl_name
 first
 fourth
  ...

到目前为止，我已经能够安排观察、对它们进行分组并成功过滤每组的第一个和最后一个实例：

test_df <- df %>% 
  arrange(l.fully_qualf_tbl_nm) %>% 
  group_by(l.fully_qualf_tbl_nm) %>%
  filter(row_number()==1 | row_number()==n()) %>%
  mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>% 
  select(l.run_dt, l.fully_qualf_tbl_nm, pct_change) %>% 
  drop_na(pct_change)

但是我的计算

mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>%

未生成正确的结果。我从另一个讨论 %-change 的 SO post 中提取了我的 pct-change 计算，但我从我的手工计算中得到了不同的数字。

例如，我得到 "second = 3.61"，但手算（以及 excel）得到 1.74。我也得到了 "third = 0.831" 而不是 1.29 。我的猜测是我没有正确指定我只希望对每个组（每对两行）进行计算。我想知道我是否应该单独计算滞后，或者我是否只是错误地实施了 lag()？

接下来，我认为新的 table 将以某种方式创建

if return value of filter(row_number()==1 | row_number()==n()) %>% == 0, append to list/table

但老实说，我不知道该怎么做。我想知道我是否应该只做一个单独的函数并将其分配给一个新变量。

Answer 1

df <- read.table(
  header = T, 
  stringsAsFactors = F,
  text = " date    tbl_name    row_cnt
2/12/2019  first       247
6/5/2019   first       247
4/24/2019  second    3617138
6/5/2019   second    3680095
3/1/2019   third    62700321
6/5/2019   third    63509189
4/24/2019  fourth       2
6/5/2019   fourth       2")

# Wrapping in parentheses assigns the output to test_df and also prints it
(test_df <- df %>% 
    group_by(tbl_name) %>%
    mutate(pct_change = ((row_cnt/lag(row_cnt) - 1) * 100)) %>% 
    ungroup() %>%
    filter(!is.na(pct_change)) %>%  # Filter after pct_change calc, since we want to 
                                    # include change from 1:2  and from n-1:n
    select(tbl_name, row_cnt, pct_change))

# A tibble: 4 x 3
  tbl_name  row_cnt pct_change
  <chr>       <int>      <dbl>
1 first         247       0   
2 second    3680095       1.74
3 third    63509189       1.29
4 fourth          2       0

拆分成两张表，好像可以这样：

first_tbl <- test_df %>% filter(pct_change != 0) # or "pct_change > 0" for pos growth
second_tbl <- test_df %>% filter(pct_change == 0)

删除 lag == 0 的成对行并使用 dplyr 和链接计算变化百分比

remove paired rows where lag == 0 and calculate % change using dplyr and chaining

r

pipe

dplyr

tibble