传播一个 data.frame 具有重复的列
Spread a data.frame with repetitive column
我有一个很大的 data.frame 正试图传播。玩具示例如下所示。
data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))
head(data)
date ticker value
1 2019 SPY 1
2 2020 SPY 2
3 2019 MSFT 3
4 2020 MSFT 4
我想传播它,所以 data.frame 看起来像这样。
spread(data, key = ticker, value = value)
date MSFT SPY
1 2019 3 1
2 2020 4 2
但是,当我在实际 data.frame 上执行此操作时,出现错误。
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882
下面是我的data.frame
的头尾
head(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2008-02-01 SPY NA
2 2008-02-04 SPY NA
3 2008-02-05 SPY NA
4 2008-02-06 SPY NA
5 2008-02-07 SPY NA
6 2008-02-08 SPY -0.0478
tail(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2020-02-12 MDYV 0.00293
2 2020-02-13 MDYV 0.00917
3 2020-02-14 MDYV 0.0179
4 2020-02-18 MDYV 0.0107
5 2020-02-19 MDYV 0.00422
6 2020-02-20 MDYV 0.00347
如评论中所述,对于相同的日期标记组合,您有多个值。您需要定义如何处理它。
这里有一个代表:
library(tidyr)
library(dplyr)
# your data is more like:
data = data.frame(
date = c(2019, rep(c("2019", "2020"), 2)),
ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"),
value = c(8, 1, 2, 3, 4))
# With two values for same date-ticker combination
data
#> date ticker value
#> 1 2019 SPY 8
#> 2 2019 SPY 1
#> 3 2020 SPY 2
#> 4 2019 MSFT 3
#> 5 2020 MSFT 4
# Results in error
data %>%
spread(ticker, value)
#> Error: Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 1, 2
# New pivot_wider() Creates list-columns for duplicates
data %>%
pivot_wider(names_from = ticker, values_from = value,)
#> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(value = list)` to suppress this warning.
#> * Use `values_fn = list(value = length)` to identify where the duplicates arise
#> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
#> # A tibble: 2 x 3
#> date SPY MSFT
#> <fct> <list> <list>
#> 1 2019 <dbl [2]> <dbl [1]>
#> 2 2020 <dbl [1]> <dbl [1]>
# Otherwise, decide yourself how to summarise duplicates with mean() for instance
data %>%
group_by(date, ticker) %>%
summarise(value = mean(value, na.rm = TRUE)) %>%
spread(ticker, value)
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> date MSFT SPY
#> <fct> <dbl> <dbl>
#> 1 2019 3 4.5
#> 2 2020 4 2
由 reprex 包 (v0.3.0) 创建于 2020-02-22
您可以使用 dplyr
和 tidyr
包。要消除该错误,您必须首先对每个组的值求和。
data %>%
group_by(date, ticker) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = ticker, values_from = value)
# date MSFT SPY
# <fct> <dbl> <dbl>
# 1 2019 3 1
# 2 2020 4 2
我有一个很大的 data.frame 正试图传播。玩具示例如下所示。
data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))
head(data)
date ticker value
1 2019 SPY 1
2 2020 SPY 2
3 2019 MSFT 3
4 2020 MSFT 4
我想传播它,所以 data.frame 看起来像这样。
spread(data, key = ticker, value = value)
date MSFT SPY
1 2019 3 1
2 2020 4 2
但是,当我在实际 data.frame 上执行此操作时,出现错误。
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882
下面是我的data.frame
的头尾head(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2008-02-01 SPY NA
2 2008-02-04 SPY NA
3 2008-02-05 SPY NA
4 2008-02-06 SPY NA
5 2008-02-07 SPY NA
6 2008-02-08 SPY -0.0478
tail(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2020-02-12 MDYV 0.00293
2 2020-02-13 MDYV 0.00917
3 2020-02-14 MDYV 0.0179
4 2020-02-18 MDYV 0.0107
5 2020-02-19 MDYV 0.00422
6 2020-02-20 MDYV 0.00347
如评论中所述,对于相同的日期标记组合,您有多个值。您需要定义如何处理它。
这里有一个代表:
library(tidyr)
library(dplyr)
# your data is more like:
data = data.frame(
date = c(2019, rep(c("2019", "2020"), 2)),
ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"),
value = c(8, 1, 2, 3, 4))
# With two values for same date-ticker combination
data
#> date ticker value
#> 1 2019 SPY 8
#> 2 2019 SPY 1
#> 3 2020 SPY 2
#> 4 2019 MSFT 3
#> 5 2020 MSFT 4
# Results in error
data %>%
spread(ticker, value)
#> Error: Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 1, 2
# New pivot_wider() Creates list-columns for duplicates
data %>%
pivot_wider(names_from = ticker, values_from = value,)
#> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(value = list)` to suppress this warning.
#> * Use `values_fn = list(value = length)` to identify where the duplicates arise
#> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
#> # A tibble: 2 x 3
#> date SPY MSFT
#> <fct> <list> <list>
#> 1 2019 <dbl [2]> <dbl [1]>
#> 2 2020 <dbl [1]> <dbl [1]>
# Otherwise, decide yourself how to summarise duplicates with mean() for instance
data %>%
group_by(date, ticker) %>%
summarise(value = mean(value, na.rm = TRUE)) %>%
spread(ticker, value)
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> date MSFT SPY
#> <fct> <dbl> <dbl>
#> 1 2019 3 4.5
#> 2 2020 4 2
由 reprex 包 (v0.3.0) 创建于 2020-02-22
您可以使用 dplyr
和 tidyr
包。要消除该错误,您必须首先对每个组的值求和。
data %>%
group_by(date, ticker) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = ticker, values_from = value)
# date MSFT SPY
# <fct> <dbl> <dbl>
# 1 2019 3 1
# 2 2020 4 2