如何在使用 .txt 文件形成的 python 数据帧中加速 searching/Filtering？

Question

我有多个 .txt 文件。我已经使用

导入并组合它们以形成 python 数据框

all_files = glob.glob(os.path.join(path, "*.txt"))

np_array_list = []
for file in all_files:
    df = pd.read_table(file, index_col = None, header = 0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

导入大约 20 个文件并形成一个数据框大约需要 19 秒。有更快的方法吗？

其次，数据框形成后，它包含约 800 万行。我需要使用第 5 列

中值的条件过滤行

" 长度为 12 且以 '26' 开头的值 "

我通过以下代码实现了这一点。

big_frame.columns = ["One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight"]

big_frame['Five'] = big_frame['Five'].astype('str')

mask = (big_frame['Five'].str.len() == 12) & (big_frame['Five'].str.startswith('26'))

big_frame = big_frame.loc[mask]

过滤掉所有符合我的标准的值需要永远。我只用一个 .txt 文件验证了代码。它在 ~ 3 秒内完成所有处理。

但我需要尽快处理所有文件。有更好的方法吗？

Answer 1

在构建 Dataframe 时，您似乎在创建数据帧，然后转换为矩阵，然后返回...看看使用 pandas dataframe.append 函数是否更快（http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)

关于第二个主题 - 如果您创建一个 len 列和前两个 char 列然后对其进行过滤是否有帮助？还有一个条件比另一个强得多吗？这听起来更像是内存管理问题而不是计算问题。

Answer 2

一个可能的解决方案是先过滤然后 concat 在一起，但性能取决于实际数据：

all_files = glob.glob(os.path.join(path, "*.txt"))

dfs = []
for file in all_files:
    df = pd.read_csv(file, index_col = None, header = 0)
    df.columns = ["One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight"]
    mask = (df['Five'].str.len() == 12) & (df['Five'].str.startswith('26'))
    dfs.append(df[mask])

big_frame = pd.concat(dfs, ignore_index=True)

如何在使用 .txt 文件形成的 python 数据帧中加速 searching/Filtering？

How to speed up searching/Filtering in python dataframe formed using .txt files?

python

numpy

python-import

python-2.7

pandas