长度不等的列和 headers 埋在这些列中
Columns with unequal length and headers buried within these columns
我有一些数据,其中每第二列对应一个特定时间,每个时间段分别有 'buy' 和 'sell' 个位置,每个位置都有两个因素(如下所示) .但是,列的长度不等,因此 'sell' 选项从不同的行开始(隐藏在值之间)。
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
最终,我希望 table 结构如下。
time, position, factor, value
time1, buy, factor1, 1
time1, buy, factor2, 4
time1, buy, factor1, 7
time1, buy, factor2, 10
time1, buy, factor1, 13
time1, buy, factor2, 15
time1, sell, factor1, 20
time1, sell, factor2, 22
time2, buy, factor1, 2
time2, buy, factor2, 5
time2, buy, factor1, 8
time2, buy, factor2, 11
time2, sell, factor1, 16
time2, sell, factor2, 18
time3, buy, factor1, 3
time3, buy, factor2, 6
time3, buy, factor1, 9
time3, buy, factor2, 12
time3, buy, factor1, 14
time3, buy, factor2, 17
time3, buy, factor1, 19
time3, buy, factor2, 21
time3, sell, factor1, 23
time3, sell, factor2, 24
time3, sell, factor1, 25
time3, sell, factor2, 26
我能够提取索引,然后在 R 中分别创建 'buy' 和 'sell' 列表。但我不确定这是否是最简单的方法(我有很多这样的文件,并且会更喜欢快速的自动方法)。我也愿意进行转换 i Python,而不是 R.
# For each column find the index of buy, sell (and the corresponding empty cell)
idx = apply(data, 2, function(x) which(x %in% c("buy","sell",""))[1:3] )
# NA indicates that the empty cell is the last
idx[is.na(idx)] = nrow(data)
i = 0
buy = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[1]+1,x[2]),i]
}) )
i = 0
sell = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[2]+1,x[3]),i]
}) )
我决定首先将 3 组 2 列合并到一个长格式数据集中。然后按最后已知值结转(tidyr::fill
)填充位置列,并通过过滤列值来过滤掉垃圾。
这是工作示例:
library(dplyr)
library(tidyr)
str <- "
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
"
strfile <- textConnection(str)
raw <- read.table(strfile, header = F, sep = ",", stringsAsFactors = F)
library(dplyr)
library(tidyr)
dt <- do.call(rbind, lapply(1:3, function(x) {
p <- raw[,c(x*2-1,x*2)]
names(p) <- c('factor', 'value')
p$time <- x
p
})
)
dt %>%
mutate(position = if_else(trimws(factor) %in% c('buy','sell'),as.character(factor),as.character(NA)),
value = as.numeric(value)) %>%
fill(position) %>% filter(!is.na(value))
结果:
factor value time position
1 factor1 1 1 buy
2 factor2 4 1 buy
3 factor1 7 1 buy
4 factor2 10 1 buy
5 factor1 13 1 buy
6 factor2 15 1 buy
7 factor1 20 1 sell
8 factor2 22 1 sell
9 factor1 2 2 buy
10 factor2 5 2 buy
11 factor1 8 2 buy
12 factor2 11 2 buy
13 factor1 16 2 sell
14 factor2 18 2 sell
15 factor1 3 3 buy
16 factor2 6 3 buy
17 factor1 9 3 buy
18 factor2 12 3 buy
19 factor1 14 3 buy
20 factor2 17 3 buy
21 factor1 19 3 buy
22 factor2 21 3 buy
23 factor1 23 3 sell
24 factor2 24 3 sell
25 factor1 25 3 sell
26 factor2 26 3 sell
我有一些数据,其中每第二列对应一个特定时间,每个时间段分别有 'buy' 和 'sell' 个位置,每个位置都有两个因素(如下所示) .但是,列的长度不等,因此 'sell' 选项从不同的行开始(隐藏在值之间)。
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
最终,我希望 table 结构如下。
time, position, factor, value
time1, buy, factor1, 1
time1, buy, factor2, 4
time1, buy, factor1, 7
time1, buy, factor2, 10
time1, buy, factor1, 13
time1, buy, factor2, 15
time1, sell, factor1, 20
time1, sell, factor2, 22
time2, buy, factor1, 2
time2, buy, factor2, 5
time2, buy, factor1, 8
time2, buy, factor2, 11
time2, sell, factor1, 16
time2, sell, factor2, 18
time3, buy, factor1, 3
time3, buy, factor2, 6
time3, buy, factor1, 9
time3, buy, factor2, 12
time3, buy, factor1, 14
time3, buy, factor2, 17
time3, buy, factor1, 19
time3, buy, factor2, 21
time3, sell, factor1, 23
time3, sell, factor2, 24
time3, sell, factor1, 25
time3, sell, factor2, 26
我能够提取索引,然后在 R 中分别创建 'buy' 和 'sell' 列表。但我不确定这是否是最简单的方法(我有很多这样的文件,并且会更喜欢快速的自动方法)。我也愿意进行转换 i Python,而不是 R.
# For each column find the index of buy, sell (and the corresponding empty cell)
idx = apply(data, 2, function(x) which(x %in% c("buy","sell",""))[1:3] )
# NA indicates that the empty cell is the last
idx[is.na(idx)] = nrow(data)
i = 0
buy = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[1]+1,x[2]),i]
}) )
i = 0
sell = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[2]+1,x[3]),i]
}) )
我决定首先将 3 组 2 列合并到一个长格式数据集中。然后按最后已知值结转(tidyr::fill
)填充位置列,并通过过滤列值来过滤掉垃圾。
这是工作示例:
library(dplyr)
library(tidyr)
str <- "
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
"
strfile <- textConnection(str)
raw <- read.table(strfile, header = F, sep = ",", stringsAsFactors = F)
library(dplyr)
library(tidyr)
dt <- do.call(rbind, lapply(1:3, function(x) {
p <- raw[,c(x*2-1,x*2)]
names(p) <- c('factor', 'value')
p$time <- x
p
})
)
dt %>%
mutate(position = if_else(trimws(factor) %in% c('buy','sell'),as.character(factor),as.character(NA)),
value = as.numeric(value)) %>%
fill(position) %>% filter(!is.na(value))
结果:
factor value time position
1 factor1 1 1 buy
2 factor2 4 1 buy
3 factor1 7 1 buy
4 factor2 10 1 buy
5 factor1 13 1 buy
6 factor2 15 1 buy
7 factor1 20 1 sell
8 factor2 22 1 sell
9 factor1 2 2 buy
10 factor2 5 2 buy
11 factor1 8 2 buy
12 factor2 11 2 buy
13 factor1 16 2 sell
14 factor2 18 2 sell
15 factor1 3 3 buy
16 factor2 6 3 buy
17 factor1 9 3 buy
18 factor2 12 3 buy
19 factor1 14 3 buy
20 factor2 17 3 buy
21 factor1 19 3 buy
22 factor2 21 3 buy
23 factor1 23 3 sell
24 factor2 24 3 sell
25 factor1 25 3 sell
26 factor2 26 3 sell