将增量数字添加到 .csv 文件中的列的最佳方法
Best way to add incremental numbers to a column in a .csv file
我有一个数据集(以 .csv 文件的形式),其中包括许多列,其中一列包括 "genres"(电视节目)。有多个列(一列用于节目标题,一列用于剧集编号,一列用于概要等)我想创建一个新列,为 "genre" 的每个条目连续编号。例如。所以纪录片的第一个实例应该跟在“1”之后,第二个条目跟在“2”之后,依此类推。那么,当有新的流派出现时,就应该从“1”重新开始。如果不清楚,这就是我的意思:
Documentary, 1
Documentary, 2
Documentary, 3
Documentary, 4
Drama, 1
Drama, 2
Drama, 3
Drama, 4
Drama, 5
Sport, 1
Sport, 2
Sport, 3
如果相关,则一种类型出现的次数会有所不同。还有几百个 .csv 文件我需要应用它,所以手动添加这些数据不是一个选项!
我想知道是否有人可以建议我如何去做这件事?我不是最懂数据的人,所以简单的方法值得赞赏!我已经了解了一些关于 R 的知识,并怀疑您可以通过编写一个涉及 if / else 循环的脚本来做到这一点(例如,如果下一个字段包含与前一个字段相同的内容,则从 1 开始添加 1 - 请原谅糟糕的语法,但你得到这个想法!)我在 Tableau 中可视化这些数据并注意到他们现在有 Tableau Prep - 也许可以在那里完成?欢迎任何解决方案!
在 R 中有多种方法可以做到这一点。下面是使用 tidyverse
软件包套件中的函数的一种方法。我们首先按流派分组,然后添加一列,该列仅从 1 计数到该流派中的脚本数量。根据您的需要,我已经为新列的外观提供了两个选项。
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(genre = sample(c("Drama", "Comedy", "Sport", "Documentary"), 20, replace=TRUE))
# Add columns to number scripts within each genre
dat = dat %>%
group_by(genre) %>%
mutate(count = 1:n(),
count2 = paste0(genre, ", ", 1:n()))
dat
genre count count2
1 Drama 1 Drama, 1
2 Sport 1 Sport, 1
3 Sport 2 Sport, 2
4 Drama 2 Drama, 2
5 Documentary 1 Documentary, 1
6 Documentary 2 Documentary, 2
7 Drama 3 Drama, 3
8 Documentary 3 Documentary, 3
9 Comedy 1 Comedy, 1
10 Sport 3 Sport, 3
11 Sport 4 Sport, 4
12 Drama 4 Drama, 4
13 Documentary 4 Documentary, 4
14 Drama 5 Drama, 5
15 Comedy 2 Comedy, 2
16 Documentary 5 Documentary, 5
17 Documentary 6 Documentary, 6
18 Drama 6 Drama, 6
19 Comedy 3 Comedy, 3
20 Drama 7 Drama, 7
如果您希望对数据进行排序,您可以这样做,例如:
dat %>% arrange(genre, count)
genre count count2
1 Comedy 1 Comedy, 1
2 Comedy 2 Comedy, 2
3 Comedy 3 Comedy, 3
4 Documentary 1 Documentary, 1
5 Documentary 2 Documentary, 2
6 Documentary 3 Documentary, 3
7 Documentary 4 Documentary, 4
8 Documentary 5 Documentary, 5
9 Documentary 6 Documentary, 6
10 Drama 1 Drama, 1
11 Drama 2 Drama, 2
12 Drama 3 Drama, 3
13 Drama 4 Drama, 4
14 Drama 5 Drama, 5
15 Drama 6 Drama, 6
16 Drama 7 Drama, 7
17 Sport 1 Sport, 1
18 Sport 2 Sport, 2
19 Sport 3 Sport, 3
20 Sport 4 Sport, 4
library(dplyr)
library(tidyr)
df <- data.frame(genre = c("Documentary", "Documentary", "Documentary", "Sport", "Sport", "Drama"), rating = c(2,2,4,4,6,6))
df %>% group_by(genre) %>% mutate(id = row_number()) %>% unite(genre_number, c("genre", "id"), sep = " ")
# A tibble: 6 x 2
genre_number rating
<chr> <dbl>
1 Documentary 1 2
2 Documentary 2 2
3 Documentary 3 4
4 Sport 1 4
5 Sport 2 6
6 Drama 1 6
编辑:为了处理您的批处理文件,您可以使任何功能成为一个函数并将其应用于文件列表。
library(dplyr)
library(tidyr)
number_genres <- function(x) {
x %>%
group_by(genre) %>%
mutate(id = row_number()) %>%
unite(genre_number, c("genre", "id"), sep = " ")
}
dir <- "C:/Documents/test" #location of your .csv files
filenames <- list.files(path = dir, pattern = "*.csv", full.names = FALSE) # gets your file names
data_list <- lapply(filenames, read.csv) # reads your files
names(data_list) <- filenames #names your list with respective csv names
numbered <- lapply(data_list, number_genres) # apply your function to your data_list
lapply(1:length(numbered), function(i) write.csv(numbered[[i]],
file = paste0(names(numbered[i])),
row.names = FALSE)) #writes the data to .csv
我有一个数据集(以 .csv 文件的形式),其中包括许多列,其中一列包括 "genres"(电视节目)。有多个列(一列用于节目标题,一列用于剧集编号,一列用于概要等)我想创建一个新列,为 "genre" 的每个条目连续编号。例如。所以纪录片的第一个实例应该跟在“1”之后,第二个条目跟在“2”之后,依此类推。那么,当有新的流派出现时,就应该从“1”重新开始。如果不清楚,这就是我的意思:
Documentary, 1
Documentary, 2
Documentary, 3
Documentary, 4
Drama, 1
Drama, 2
Drama, 3
Drama, 4
Drama, 5
Sport, 1
Sport, 2
Sport, 3
如果相关,则一种类型出现的次数会有所不同。还有几百个 .csv 文件我需要应用它,所以手动添加这些数据不是一个选项!
我想知道是否有人可以建议我如何去做这件事?我不是最懂数据的人,所以简单的方法值得赞赏!我已经了解了一些关于 R 的知识,并怀疑您可以通过编写一个涉及 if / else 循环的脚本来做到这一点(例如,如果下一个字段包含与前一个字段相同的内容,则从 1 开始添加 1 - 请原谅糟糕的语法,但你得到这个想法!)我在 Tableau 中可视化这些数据并注意到他们现在有 Tableau Prep - 也许可以在那里完成?欢迎任何解决方案!
在 R 中有多种方法可以做到这一点。下面是使用 tidyverse
软件包套件中的函数的一种方法。我们首先按流派分组,然后添加一列,该列仅从 1 计数到该流派中的脚本数量。根据您的需要,我已经为新列的外观提供了两个选项。
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(genre = sample(c("Drama", "Comedy", "Sport", "Documentary"), 20, replace=TRUE))
# Add columns to number scripts within each genre
dat = dat %>%
group_by(genre) %>%
mutate(count = 1:n(),
count2 = paste0(genre, ", ", 1:n()))
dat
genre count count2 1 Drama 1 Drama, 1 2 Sport 1 Sport, 1 3 Sport 2 Sport, 2 4 Drama 2 Drama, 2 5 Documentary 1 Documentary, 1 6 Documentary 2 Documentary, 2 7 Drama 3 Drama, 3 8 Documentary 3 Documentary, 3 9 Comedy 1 Comedy, 1 10 Sport 3 Sport, 3 11 Sport 4 Sport, 4 12 Drama 4 Drama, 4 13 Documentary 4 Documentary, 4 14 Drama 5 Drama, 5 15 Comedy 2 Comedy, 2 16 Documentary 5 Documentary, 5 17 Documentary 6 Documentary, 6 18 Drama 6 Drama, 6 19 Comedy 3 Comedy, 3 20 Drama 7 Drama, 7
如果您希望对数据进行排序,您可以这样做,例如:
dat %>% arrange(genre, count)
genre count count2 1 Comedy 1 Comedy, 1 2 Comedy 2 Comedy, 2 3 Comedy 3 Comedy, 3 4 Documentary 1 Documentary, 1 5 Documentary 2 Documentary, 2 6 Documentary 3 Documentary, 3 7 Documentary 4 Documentary, 4 8 Documentary 5 Documentary, 5 9 Documentary 6 Documentary, 6 10 Drama 1 Drama, 1 11 Drama 2 Drama, 2 12 Drama 3 Drama, 3 13 Drama 4 Drama, 4 14 Drama 5 Drama, 5 15 Drama 6 Drama, 6 16 Drama 7 Drama, 7 17 Sport 1 Sport, 1 18 Sport 2 Sport, 2 19 Sport 3 Sport, 3 20 Sport 4 Sport, 4
library(dplyr)
library(tidyr)
df <- data.frame(genre = c("Documentary", "Documentary", "Documentary", "Sport", "Sport", "Drama"), rating = c(2,2,4,4,6,6))
df %>% group_by(genre) %>% mutate(id = row_number()) %>% unite(genre_number, c("genre", "id"), sep = " ")
# A tibble: 6 x 2
genre_number rating
<chr> <dbl>
1 Documentary 1 2
2 Documentary 2 2
3 Documentary 3 4
4 Sport 1 4
5 Sport 2 6
6 Drama 1 6
编辑:为了处理您的批处理文件,您可以使任何功能成为一个函数并将其应用于文件列表。
library(dplyr)
library(tidyr)
number_genres <- function(x) {
x %>%
group_by(genre) %>%
mutate(id = row_number()) %>%
unite(genre_number, c("genre", "id"), sep = " ")
}
dir <- "C:/Documents/test" #location of your .csv files
filenames <- list.files(path = dir, pattern = "*.csv", full.names = FALSE) # gets your file names
data_list <- lapply(filenames, read.csv) # reads your files
names(data_list) <- filenames #names your list with respective csv names
numbered <- lapply(data_list, number_genres) # apply your function to your data_list
lapply(1:length(numbered), function(i) write.csv(numbered[[i]],
file = paste0(names(numbered[i])),
row.names = FALSE)) #writes the data to .csv