R 将列表对象放入 table
R taking list objects into table
我有一个列表,里面有很多对象。我想根据这个列表的属性创建一个 table。
head(casscade.list)
$`444424960908754944`
screen_name tweet_id tweet_created_at retweet_screen_name
NerSeref 6.028628e+17 2015-05-25 11:44:24 Lasthowen
DURULMA_ZAMANI 6.028631e+17 2015-05-25 11:45:32 Lasthowen
ssari75 6.028647e+17 2015-05-25 11:52:10 Lasthowen
saintserif2009 6.028672e+17 2015-05-25 12:01:48 Lasthowen
Hejinilim 6.028721e+17 2015-05-25 12:21:13 Lasthowen
$`407136916317171712`
screen_name tweet_id tweet_created_at retweet_screen_name
isa_sakar 6.072663e+17 2015-06-06 15:22:18 cavurizmir
canfeda1923 6.072666e+17 2015-06-06 15:23:34 cavurizmir
Apolloniuss_58 6.072669e+17 2015-06-06 15:24:47 cavurizmir
我需要创建一个必须有这些的 table;
table
retweet_screen_name screen_name length life(seconds)
Lasthowen Hejinilim 5 2209
cavurizmir Apolloniuss_58 3 149
- 第一行将是转推屏幕名称中的名称(因为它是重复的,其中一个就足够了),
- 第二行将是列表对象的最后screen_name
- 第三行将是列表对象的长度
- 第四行将是第一个tweet_created_at和最后一个列表对象
之间的时间差
我用了这个功能,解决了一半的问题
get.summary <- function(i){
curr.frame = cascade.list[[i]]
return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)],
nrow(curr.frame)))
}
和此代码:
cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))
它创建一个数据框,所有变量都在同一行中。
V1 V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1") c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")
我需要修复数据帧结构,它应该有 6 列和行,等于列表长度。我还需要添加到时间变量。
提前感谢所有建议。
因为cascade.list
是一个列数相等的dataframes的列表,你可以将它们绑定到一个数据集中,然后执行你需要的聚合。 data.table
:
的实现
# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)
使用生成的数据表,您现在可以按如下方式执行所需的汇总:
DT[, .(screen_name = screen_name[.N],
length = .N,
life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
by = .(retweet_screen_name)]
这导致:
retweet_screen_name screen_name length life_in_seconds
1: Lasthowen Hejinilim 5 2209 secs
2: cavurizmir Apolloniuss_58 3 149 secs
解释:
.N
是一个特殊的 data.table
运算符,它为您提供组中的总行数(或 data.table 当不使用分组时)。
screen_name[.N]
会给你最后一个 screen_name
因为它是用总行数索引的,因此给你每个组的最后一次观察。同样,screen_name[1]
会给你每个组中的第一个观察结果。
difftime
或多或少不言自明。使用 units
可以指定时差的表示方式。请参阅 ?difftime
了解可能性。
- 使用
by =
,您可以指定应使用哪些列来确定数据分组。
类似的操作可以用dplyr
来完成:
library(dplyr)
newdf <- bind_rows(dflist)
newdf %>% group_by(retweet_screen_name) %>%
summarise(screen_name = last(screen_name),
length = n(),
life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))
使用数据:
df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))
我有一个列表,里面有很多对象。我想根据这个列表的属性创建一个 table。
head(casscade.list)
$`444424960908754944`
screen_name tweet_id tweet_created_at retweet_screen_name
NerSeref 6.028628e+17 2015-05-25 11:44:24 Lasthowen
DURULMA_ZAMANI 6.028631e+17 2015-05-25 11:45:32 Lasthowen
ssari75 6.028647e+17 2015-05-25 11:52:10 Lasthowen
saintserif2009 6.028672e+17 2015-05-25 12:01:48 Lasthowen
Hejinilim 6.028721e+17 2015-05-25 12:21:13 Lasthowen
$`407136916317171712`
screen_name tweet_id tweet_created_at retweet_screen_name
isa_sakar 6.072663e+17 2015-06-06 15:22:18 cavurizmir
canfeda1923 6.072666e+17 2015-06-06 15:23:34 cavurizmir
Apolloniuss_58 6.072669e+17 2015-06-06 15:24:47 cavurizmir
我需要创建一个必须有这些的 table;
table
retweet_screen_name screen_name length life(seconds)
Lasthowen Hejinilim 5 2209
cavurizmir Apolloniuss_58 3 149
- 第一行将是转推屏幕名称中的名称(因为它是重复的,其中一个就足够了),
- 第二行将是列表对象的最后screen_name
- 第三行将是列表对象的长度
- 第四行将是第一个tweet_created_at和最后一个列表对象 之间的时间差
我用了这个功能,解决了一半的问题
get.summary <- function(i){
curr.frame = cascade.list[[i]]
return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)],
nrow(curr.frame)))
}
和此代码:
cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))
它创建一个数据框,所有变量都在同一行中。
V1 V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1") c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")
我需要修复数据帧结构,它应该有 6 列和行,等于列表长度。我还需要添加到时间变量。
提前感谢所有建议。
因为cascade.list
是一个列数相等的dataframes的列表,你可以将它们绑定到一个数据集中,然后执行你需要的聚合。 data.table
:
# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)
使用生成的数据表,您现在可以按如下方式执行所需的汇总:
DT[, .(screen_name = screen_name[.N],
length = .N,
life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
by = .(retweet_screen_name)]
这导致:
retweet_screen_name screen_name length life_in_seconds
1: Lasthowen Hejinilim 5 2209 secs
2: cavurizmir Apolloniuss_58 3 149 secs
解释:
.N
是一个特殊的data.table
运算符,它为您提供组中的总行数(或 data.table 当不使用分组时)。screen_name[.N]
会给你最后一个screen_name
因为它是用总行数索引的,因此给你每个组的最后一次观察。同样,screen_name[1]
会给你每个组中的第一个观察结果。difftime
或多或少不言自明。使用units
可以指定时差的表示方式。请参阅?difftime
了解可能性。- 使用
by =
,您可以指定应使用哪些列来确定数据分组。
类似的操作可以用dplyr
来完成:
library(dplyr)
newdf <- bind_rows(dflist)
newdf %>% group_by(retweet_screen_name) %>%
summarise(screen_name = last(screen_name),
length = n(),
life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))
使用数据:
df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))