如何弹出一系列 csv 文件中不匹配的列名？

Question

我正在读取多个 csv 文件（20 个文件）并最终创建一个数据框。虽然我用眼睛手动检查，但列名是相同的。但是，由于某种原因，我收到以下错误。

Error in match.names(clabs, names(xi)) : names do not match previous names

这是我写的代码

fnames <- list.files("C:/Users/code",pattern='^La') # getting all the files from directory. Update it as required
csv <- lapply(fnames,read.csv)  # reading all the files
source_DF <- do.call(rbind, lapply(csv, '[', 1:8)) # This is the line where it throws error

请注意，我使用 1:8 是因为有时 R 读取的列数不均匀。例如，我所有的 csv 文件只有 8 列，但在读取时，有时它有 12 列，有些甚至有 50 列。所以，为了避免我得到 1:8。也欢迎任何其他阅读前 8 列的方法

如何找出哪个 csv 文件存在此命名问题以及导致此问题的列是什么？

解决此错误的任何帮助都非常有用

Answer 1

我会在这里使用一个循环，并根据之前的名称检查每组名称：

dfs <- list(
  data.frame(foo = 1, bar = 2),
  data.frame(foo = 2, bar = 2),
  data.frame(foo = 3, baz = 2),
  data.frame(foo = 4, bar = 2)
)

for (i in seq_len(length(dfs) - 1)) {
  different <- names(dfs[[i]]) != names(dfs[[i + 1]])
  if (any(different)) {
    message("Names of column(s) ", paste(which(different), collapse = ", "),
            " in data frame ", i + 1, " differ from the previous ones.")
  }
}
#> Names of column(s) 2 in data frame 3 differ from the previous ones.
#> Names of column(s) 2 in data frame 4 differ from the previous ones.

或者，如果您只想存储不匹配项：

mismatches <- list(integer())
for (i in seq_len(length(dfs) - 1)) {
  different <- names(dfs[[i]]) != names(dfs[[i + 1]])
  mismatches[[i + 1]] <- which(different)
}

str(mismatches)
#> List of 4
#>  $ : int(0) 
#>  $ : int(0) 
#>  $ : int 2
#>  $ : int 2

^{由 reprex package (v0.3.0.9000)}

创建于 2019-09-05

Answer 2

检查它的一种方法是对每个数据帧的前 8 列进行子集化，获取所有数据帧中存在的通用名称，然后使用 setdiff 查明是否存在任何不匹配的列名称

list_df <- lapply(csv, '[', 1:8)
cols <- Reduce(intersect, lapply(list_df, names))
lapply(list_df, function(x) setdiff(names(x), cols))

如果你所有的列名都相同，你应该得到 character(0) 作为每个数据帧的输出。如果有任何不匹配，setdiff 将显示列的名称。

另一个要检查的提示是 length(cols) 8 ?

如何弹出一系列 csv 文件中不匹配的列名？

How to pop out non-matching column names in a series of csv files?

r

lapply

dataframe

sapply

rbind