将出现的字符串分组到一行

Grouping occurences of a string to a row

tl;博士 有没有办法将大量值组合到一个列中而不截断这些值?


我正在 RStudio 上处理一个包含 48,178 个条目的数据框。数据框有两列,第一列包含唯一的数值,另一列包含重复的字符串。

----------
id    name
1     forest
2     forest
3     park
4     riverbank
.
.
.
.
.
48178   water
----------

我想根据第 2 列中的唯一条目将所有条目组合在一起。我使用包 "ddply" 来实现结果。我现在有以下派生 table:

----------
type         V1
forest       forest,forest,forest
park         park,park,park,park
riverbank    riverbank,riverbank,
water        water,water,water,water
----------

但是,在对派生数据框应用 str 函数时,我发现该列包含被截断的值,而不是每个字符串的每个实例。

str 的输出是:

'data.frame':   4 obs. of  2 variables:
 $ type: chr  "forest" "park" "riverbank" "water"
 $ V1  : chr  "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`

如何将相同的字符串组合在一起并将它们推到一行而不截断?

如果你想要的只是出现次数,那为什么不简单地使用 table 呢?

df<- read.table(head=T, text="id    name
1     forest
2     forest
3     park
4     riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs

# if for some reason you want a repetition,then 
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste,  collapse=",")
data.frame(type=df1$Var1, V1=y)

尝试使用基本 R split() 函数将结果存储在列表中:

new.list <- split(df, f=df$type)

这会将数据框拆分为多个可以使用方括号访问的数据框。当记录继续保存在单独的单元格中时,它可以防止字符串被合并和截断。

您的字符串并没有真正被截断,只是 str 显示的字符串被截断了:

size <- 48000
df <- data.frame(1:size, 
                 type=sample(c("forest", "park", "riverbank", "water" ), 
                             size, replace = TRUE), 
                 stringsAsFactors = FALSE)

res <- by(df$type , df$type, paste, collapse=",")


str(res)
 'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
 - attr(*, "dimnames")=List of 1
  ..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
 - attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")


lengths( strsplit(res, ','))
   forest      park riverbank     water 
    11993     12017     11953     12037 

sum(lengths( strsplit(res, ',')))
[1] 48000

扩展 HubertL 的答案,str() 函数完全按照预期执行,但可能是您打算执行的错误选择。

根据您在 Q 中提供的(相当有限的)信息看来您已经实现了您正在寻找的东西,即连接所有相同类型的字符串.

但是,您似乎受困于 str() 函数的输出。

请参阅帮助页面 ?str

来自描述部分:

Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.

str() 有一个参数 nchar.max 默认为 128.

nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.

示例部分中的longch示例说明了此参数的效果:

nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__

字符串的最大长度

根据?"Memory-limits"一个字符串的字节数限制在2^31 - 1 ~ 2*10^9。给定数据框中的行数和 name 的长度,连接的字符串不会超过 0.6*10^6,这远未达到限制。