在标准之后删除行

Question

我有一些数据要清理，我注意到我有 150 个文件，其中的行是前几行的子集。有没有一种方法可以在某些条件出现后删除所有内容？下面我不确定如何通过代码为此写出示例数据，因此我列出了一个数据示例作为文本。以下。我想删除“第 2 节”及以下的所有行

Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
,,
Section 2,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
,,
Section 3,,
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
,,
Section 5,,
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd

预期输出

Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd

Answer 1

在这里，我通过调用 strsplit 并使用换行符作为分隔符来“读取”您的数据。如果您是从文件中执行此操作，则可以使用 readLines

我使用 grep 找到包含“Section 2”的行号，使用它来子集 raw_data。我 paste0(..., collapse="") 不以“Section”开头的行并使用 read.table 使用 sep="," 和 header=TRUE 进行解析，就好像我只阅读了 read.csv 的那个部分.

raw_data <- strsplit(split = "\n", "Name,Age,Address
                  Section 1,,
                  Abby,10,1 Baker St
                  Alice,12,3 Main St
                  Becky,13,156 F St
                  Ben,14,2 18th St
                  Cameron,15,4 Journey Road
                  Danny,16,123 North Ave
                  Eric,17,325 Hill Blvd
                  ,,
                  Section 2,,
                  Abby,10,1 Baker St
                  Alice,12,3 Main St
                  Becky,13,156 F St
                  Ben,14,2 18th St
                  ,,
                  Section 3,,
                  Becky,13,156 F St
                  Ben,14,2 18th St
                  Cameron,15,4 Journey Road
                  Danny,16,123 North Ave
                  ,,
                  Section 5,,
                  Alice,12,3 Main St
                  Becky,13,156 F St
                  Ben,14,2 18th St
                  Cameron,15,4 Journey Road
                  Danny,16,123 North Ave
                  Eric,17,325 Hill Blvd")

section2_idx <- grep('Section 2', raw_data[[1]])

raw_data_clean <- trimws(raw_data[[1]][1:(section2_idx-2)])

allsect_idx <- grep('^Section', raw_data_clean)

if(length(allsect_idx > 0)) 
   raw_data_clean <- raw_data_clean[-allsect_idx]

read.table(text = paste0(raw_data_clean, collapse="\n"), sep=",", header = TRUE)

#>      Name Age        Address
#> 1    Abby  10     1 Baker St
#> 2   Alice  12      3 Main St
#> 3   Becky  13       156 F St
#> 4     Ben  14      2 18th St
#> 5 Cameron  15 4 Journey Road
#> 6   Danny  16  123 North Ave
#> 7    Eric  17  325 Hill Blvd

^{由 reprex package (v0.3.0)}

于 2020-12-06 创建

Answer 2

这是一个虚构的示例，它避免了输入您的起始数据。

mixed_data 有 500 个元素，每行是一个包含两个逗号的字符串。如果看起来像您的示例，则无需拆分字符串。

创建一个空向量以仅保存每个值中的一个。然后遍历整个混合列表并将唯一条目添加到该向量。此示例在 mixed_data.

中的原始 500 个项目中产生了 one_of_each 中的 444 个唯一项目

set.seed(101)

a <- sample(LETTERS,500, replace = TRUE)
b <- sample(letters,500, replace = TRUE)
d <- sample(c(1:3),500, replace = TRUE)

mixed_data <- paste0(a,",",b,",",d)
head(mixed_data)

one_of_each <- c()  #starts empty

for (i in 1:length(mixed_data)){
  
  if (mixed_data[i] %in% one_of_each == FALSE) { 
    one_of_each <- c(one_of_each,mixed_data[i])  #if not found, then add
  }
  
}

Answer 3

假设您的文本文件名为 temp.txt，您可以使用 readLines 读入它，找到其中包含 'Section 2' 的行并阅读其上方的所有行。

tmp <- readLines('temp.txt')
inds <- grep('Section 2', tmp) - 2
data <- read.csv(text = paste0(tmp[1:inds], collapse = '\n'))
data
#       Name Age        Address
#1 Section 1  NA               
#2      Abby  10     1 Baker St
#3     Alice  12      3 Main St
#4     Becky  13       156 F St
#5       Ben  14      2 18th St
#6   Cameron  15 4 Journey Road
#7     Danny  16  123 North Ave
#8      Eric  17  325 Hill Blvd

在标准之后删除行

Drop rows after criteria

r

rows

data-cleaning