将分块文件读入数据帧
Reading chunked file into dataframe
我是 pandas/r 的新手,我不太确定如何将这些数据读入 pandas
或 r
进行分析。
目前,我在想我可以使用 readr 的 read_chunkwise
,或 pandas 的 chunksize
,但这可能不是我需要的。这真的可以通过 for 循环或使用 purr 遍历所有元素轻松解决吗?
数据:
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495 wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
目前,这是我的函数,但我 运行 遇到错误:
>
convertchunkfile <- function(df){ for(i in 1:length(df)){
> #While the length of any line is not 0, process it with the following loop
> while(nchar(df[[i]]) != 0){
> case_when(
>
> #When data at x index == wine/name, then extract the data after that clause
> #Wine Name parsing
> cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
> #Wine ID parsing
> cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
> #same format for other attributes
> )
> }
> }
> }
Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1], :
more elements supplied than there are to replace
编辑:
在解决了一些问题之后,我认为这可能是最好的解决方案,借鉴了@hereismyname 的解决方案:
#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt
#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt
#Verify new encoding of the file
file -I cellartracker-clean.txt
ReadEmAndWeep <- function(file, chunk_size) {
f <- function(chunk, pos) {
data_frame(text = chunk) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
}
read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}
#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)
这里有一些代码可以将这些记录读入 pandas.DataFrame
。这些记录的结构类似于 yaml
记录,因此这段代码利用了这一事实。空行用作记录分隔符。
import pandas as pd
import collections
import yaml
def read_records(lines):
# keep track of the columns in an ordered set
columns = collections.OrderedDict()
record = []
records = []
for line in lines:
if line:
# gather each line of text until a blank line
record.append(line)
# keep track of the columns seen in an ordered set
columns[line.split(':')[0].strip()] = None
# if the line is empty and we have a record, then convert it
elif record:
# use yaml to convert the lines into a dict
records.append(yaml.load('\n'.join(record)))
record = []
# record last record
if record:
records.append(yaml.load('\n'.join(record)))
# return a pandas dataframe from the list of dicts
return pd.DataFrame(records, columns=list(columns.keys()))
测试代码:
print(read_records(data))
结果:
wine/name wine/wineId \
0 1981 Château de Beaucastel Châteaune... 18856
1 1995 Château Pichon-Longueville Baron 3495
wine/variant wine/year review/points review/time review/userId \
0 Red Rhone Blend 1981 96 1160179200 1
1 Red Bordeaux Blend 1995 93 1063929600 1
review/userName review/text
0 Eric Olive, horse sweat, dirty saddle, and smoke. T...
1 Eric A remarkably floral nose with violet and chamb...
测试数据:
data = [x.strip() for x in """
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495
wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
这里有一个在 R 中相当惯用的方法:
library(readr)
library(tidyr)
library(dplyr)
out <- data_frame(text = read_lines(the_text)) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
这是我建议的方法:
y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\/|review\/)", "~~~\1", y), "~~~", TRUE))
library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
, rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")
唯一的假设是每种新酒的第一行都以 wine/name
开头。空行等无所谓。
这里有 two datasets 供您试用。
将第一行代码中的 "your_file" 替换为 url1
或 url2
以进行尝试。
url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"
url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"
请注意,第二个数据集缺少第一个葡萄酒的 wine/variant:
值。
在 awk 或类似的东西中执行 gsub
并直接在上面执行 fread
可能会更好。
我是 pandas/r 的新手,我不太确定如何将这些数据读入 pandas
或 r
进行分析。
目前,我在想我可以使用 readr 的 read_chunkwise
,或 pandas 的 chunksize
,但这可能不是我需要的。这真的可以通过 for 循环或使用 purr 遍历所有元素轻松解决吗?
数据:
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495 wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
目前,这是我的函数,但我 运行 遇到错误:
>
convertchunkfile <- function(df){ for(i in 1:length(df)){
> #While the length of any line is not 0, process it with the following loop
> while(nchar(df[[i]]) != 0){
> case_when(
>
> #When data at x index == wine/name, then extract the data after that clause
> #Wine Name parsing
> cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
> #Wine ID parsing
> cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
> #same format for other attributes
> )
> }
> }
> }
Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1], :
more elements supplied than there are to replace
编辑:
在解决了一些问题之后,我认为这可能是最好的解决方案,借鉴了@hereismyname 的解决方案:
#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt
#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt
#Verify new encoding of the file
file -I cellartracker-clean.txt
ReadEmAndWeep <- function(file, chunk_size) {
f <- function(chunk, pos) {
data_frame(text = chunk) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
}
read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}
#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)
这里有一些代码可以将这些记录读入 pandas.DataFrame
。这些记录的结构类似于 yaml
记录,因此这段代码利用了这一事实。空行用作记录分隔符。
import pandas as pd
import collections
import yaml
def read_records(lines):
# keep track of the columns in an ordered set
columns = collections.OrderedDict()
record = []
records = []
for line in lines:
if line:
# gather each line of text until a blank line
record.append(line)
# keep track of the columns seen in an ordered set
columns[line.split(':')[0].strip()] = None
# if the line is empty and we have a record, then convert it
elif record:
# use yaml to convert the lines into a dict
records.append(yaml.load('\n'.join(record)))
record = []
# record last record
if record:
records.append(yaml.load('\n'.join(record)))
# return a pandas dataframe from the list of dicts
return pd.DataFrame(records, columns=list(columns.keys()))
测试代码:
print(read_records(data))
结果:
wine/name wine/wineId \
0 1981 Château de Beaucastel Châteaune... 18856
1 1995 Château Pichon-Longueville Baron 3495
wine/variant wine/year review/points review/time review/userId \
0 Red Rhone Blend 1981 96 1160179200 1
1 Red Bordeaux Blend 1995 93 1063929600 1
review/userName review/text
0 Eric Olive, horse sweat, dirty saddle, and smoke. T...
1 Eric A remarkably floral nose with violet and chamb...
测试数据:
data = [x.strip() for x in """
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495
wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
这里有一个在 R 中相当惯用的方法:
library(readr)
library(tidyr)
library(dplyr)
out <- data_frame(text = read_lines(the_text)) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
这是我建议的方法:
y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\/|review\/)", "~~~\1", y), "~~~", TRUE))
library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
, rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")
唯一的假设是每种新酒的第一行都以 wine/name
开头。空行等无所谓。
这里有 two datasets 供您试用。
将第一行代码中的 "your_file" 替换为 url1
或 url2
以进行尝试。
url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"
url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"
请注意,第二个数据集缺少第一个葡萄酒的 wine/variant:
值。
在 awk 或类似的东西中执行 gsub
并直接在上面执行 fread
可能会更好。