如何按 integers/digits 拆分 ASCII 文件？

Question

如果我有这样一个 ASCII 文本文件：

我想把它用整数分开，这样就变成了

v1 v2 v3 v4 v5
1  2  3  4  5

也就是说，每一个整数都是一个变量。我知道我可以在 R 中使用 read.fwf，但由于我的数据集中有近 500 个变量，是否有更好的方法将整数划分到它们自己的列中，而不是必须放置 widths=c(1,) 并重复“1”，500 次？

我也尝试将 ASCII 文件导入 Excel 和 SPSS，但两者都不允许我以固定的整数距离插入变量中断。

Answer 1

您可以按原样读取一行来确定文件的宽度，然后将其用于 read_fwf。使用 tidyverse 函数，

library(readr)
library(stringr)

path <- "path_to_data.txt" # your path

# one pass of the data
pass <- read_csv(path, col_names = FALSE, n_max = 1) # one row, no header
filewidth <- str_length(pass[1, ]) # width of first row

# use fwf with specified number of columns
df <- read_fwf(path, fwf_widths(rep(1, filewidth)))

Answer 2

这里有一个使用 read.fwf() 的选项，这是您最初的选择。

# for the example only, a two line source with different line lengths
input <-  textConnection("12345\n6789")

df1 <- read.fwf(input, widths = rep(1, 500))

ncol(df1)
# [1] 500

但假设您实际上少于 500（如您所说，本例中就是这种情况），则可以按如下方式删除所有值都设置为 NA 的额外列。这将使用最长的行来确定保留的列数。

df1 <- df1[, apply(!is.na(df1), 2, all)]

df1
#   V1 V2 V3 V4 V5
# 1  1  2  3  4  5
# 2  6  7  8  9  NA

但是，如果没有可接受的缺失值，则使用 any() 使用最短的行来确定保留的列数。

df1 <- df1[, apply(!is.na(df1), 2, any)]

df1
#   V1 V2 V3 V4
# 1  1  2  3  4
# 2  6  7  8  9

当然，如果您知道确切的行长度并且所有行的长度都相同，那么只需将 widths = rep(1, x) 和 x 设置为已知长度即可。

Answer 3

如果您使用的是 Excel 2010 或更高版本，您可以使用 Power Query（又名 Get & Transform）导入文件。当你编辑输入的时候，有一个选项可以选择split columns并指定字符数：

此工具包含在 Excel 2016 中，是 Excel 2010 及更高版本的免费 Microsoft 加载项。

如何按 integers/digits 拆分 ASCII 文件？

How to split ASCII file by integers/digits?

excel

ascii

r

spss