使用由列变量确定的块大小加载 pandas 数据帧
Load pandas dataframe with chunksize determined by column variable
如果我的 csv 文件太大而无法使用 pandas 加载到内存中(在本例中为 35gb),我知道可以使用 chunksize 分块处理文件。
但是我想知道是否可以根据列中的值更改块大小。
我有一个 ID 列,每个 ID 都有几行信息,如下所示:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...
我不想将 ID 分成不同的块。例如,将处理大小为 4 的块:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk
ID, Time, x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...
可能吗?
如果不是,也许可以使用带有 for 循环的 csv 库:
for line in file:
x += 1
if x > 1000000 and curid != line[0]:
break
curid = line[0]
#code to append line to a dataframe
虽然我知道这只会创建一个块,并且 for 循环需要很长时间来处理。
如果逐行遍历 csv 文件,您可以 yield
使用依赖于任何列的生成器分块。
工作示例:
import pandas as pd
def iter_chunk_by_id(file):
csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
first_chunk = csv_reader.get_chunk()
id = first_chunk.iloc[0,0]
chunk = pd.DataFrame(first_chunk)
for l in csv_reader:
if id == l.iloc[0,0]:
id = l.iloc[0,0]
chunk = chunk.append(l)
continue
id = l.iloc[0,0]
yield chunk
chunk = pd.DataFrame(l)
yield chunk
## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc
chunk_iter = iter_chunk_by_id("data.csv")
for chunk in chunk_iter:
print(chunk)
print("_____")
输出:
0 1 2
0 1 foo bla
1 1 off aff
_____
0 1 2
2 2 roo laa
3 2 jkl xds
_____
0 1 2
4 3 asd fds
5 3 qwe tre
6 3 tre yxc
_____
我根据@elcombato 提供的答案构建了任意块大小。我实际上有一个类似的用例并且逐行处理使我的程序慢得无法忍受
def iter_chunk_by_id(file_name, chunk_size=10000):
"""generator to read the csv in chunks of user_id records. Each next call of generator will give a df for a user"""
csv_reader = pd.read_csv(file_name, compression='gzip', iterator=True, chunksize=chunk_size, header=0, error_bad_lines=False)
chunk = pd.DataFrame()
for l in csv_reader:
l[['id', 'everything_else']] = l[
'col_name'].str.split('|', 1, expand=True)
hits = l['id'].astype(float).diff().dropna().nonzero()[0]
if not len(hits):
# if all ids are same
chunk = chunk.append(l[['col_name']])
else:
start = 0
for i in range(len(hits)):
new_id = hits[i]+1
chunk = chunk.append(l[['col_name']].iloc[start:new_id, :])
yield chunk
chunk = pd.DataFrame()
start = new_id
chunk = l[['col_name']].iloc[start:, :]
yield chunk
如果我的 csv 文件太大而无法使用 pandas 加载到内存中(在本例中为 35gb),我知道可以使用 chunksize 分块处理文件。
但是我想知道是否可以根据列中的值更改块大小。
我有一个 ID 列,每个 ID 都有几行信息,如下所示:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...
我不想将 ID 分成不同的块。例如,将处理大小为 4 的块:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk
ID, Time, x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...
可能吗?
如果不是,也许可以使用带有 for 循环的 csv 库:
for line in file:
x += 1
if x > 1000000 and curid != line[0]:
break
curid = line[0]
#code to append line to a dataframe
虽然我知道这只会创建一个块,并且 for 循环需要很长时间来处理。
如果逐行遍历 csv 文件,您可以 yield
使用依赖于任何列的生成器分块。
工作示例:
import pandas as pd
def iter_chunk_by_id(file):
csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
first_chunk = csv_reader.get_chunk()
id = first_chunk.iloc[0,0]
chunk = pd.DataFrame(first_chunk)
for l in csv_reader:
if id == l.iloc[0,0]:
id = l.iloc[0,0]
chunk = chunk.append(l)
continue
id = l.iloc[0,0]
yield chunk
chunk = pd.DataFrame(l)
yield chunk
## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc
chunk_iter = iter_chunk_by_id("data.csv")
for chunk in chunk_iter:
print(chunk)
print("_____")
输出:
0 1 2
0 1 foo bla
1 1 off aff
_____
0 1 2
2 2 roo laa
3 2 jkl xds
_____
0 1 2
4 3 asd fds
5 3 qwe tre
6 3 tre yxc
_____
我根据@elcombato 提供的答案构建了任意块大小。我实际上有一个类似的用例并且逐行处理使我的程序慢得无法忍受
def iter_chunk_by_id(file_name, chunk_size=10000):
"""generator to read the csv in chunks of user_id records. Each next call of generator will give a df for a user"""
csv_reader = pd.read_csv(file_name, compression='gzip', iterator=True, chunksize=chunk_size, header=0, error_bad_lines=False)
chunk = pd.DataFrame()
for l in csv_reader:
l[['id', 'everything_else']] = l[
'col_name'].str.split('|', 1, expand=True)
hits = l['id'].astype(float).diff().dropna().nonzero()[0]
if not len(hits):
# if all ids are same
chunk = chunk.append(l[['col_name']])
else:
start = 0
for i in range(len(hits)):
new_id = hits[i]+1
chunk = chunk.append(l[['col_name']].iloc[start:new_id, :])
yield chunk
chunk = pd.DataFrame()
start = new_id
chunk = l[['col_name']].iloc[start:, :]
yield chunk