使用 DataFrame / BigQuery 加速 Python 循环
Speed up Python loop with DataFrame / BigQuery
这个循环目前在我的桌面上以 5ghz (OC) 运行 花费了将近 3 个小时。我将如何加快速度?
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
idx = 0
for row in rows:
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
idx += 1
行是 JSON 存储在 (BigQuery) RowIterator 中的数据。
<google.cloud.bigquery.table.RowIterator object at 0x000001ADD93E7B50>
<class 'google.cloud.bigquery.table.RowIterator'>
JSON 数据如下:
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/index.html', 45), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 65), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/index.html', 64), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/products.html', 56), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/employees.html', 54), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact/cookies.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 91), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-ca/careers.html', 42), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/', 115), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/suppliers', 51), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/search.html', 60), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 50), {'clientId': 0, 'pagePath': 1, 'count': 2})
我 运行 跨越 BigQuery 中的 to_dataframe() 方法。极快。将 3 小时缩短为 3 秒。
df = query_job.result().to_dataframe()
google.cloud.bigquery.table.RowIterator
Downloading BigQuery data to pandas using the BigQuery Storage API
这不是您使用 pandas 数据框的方式。数据框代表
垂直数据,这意味着每一列都是引擎盖下的一个系列,它使用 fixed-sized numpy 数组(尽管相同数据类型的列的数组与其他列相邻)。
每次您将新行附加到数据框时,每一列的数组都会调整大小(即重新分配),这本身就很昂贵。您对每一行都执行此操作,这意味着您对唯一数据类型的每一列进行了 n
次数组重新分配迭代,这是非常低效的。此外,您还为每一行创建了一个 pd.Series,这会导致更多的分配,这在数据帧垂直表示数据时没有用。
您可以通过查看列的 id
来验证这一点
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['clientId', 'url', 'count'])
# Look at the ID of the DataFrame and the columns
>>> id(df)
1494628715776
# These are the IDs of the empty Series for each column
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494628789264, 1494630670400, 1494630670640)
# Assigning a series at an index that didn't exist before
>>> df.loc[0] = pd.Series({'clientId': 123, 'url': 123, 'count': 100})
# ID of the dataframe remains the same
>>> id(df)
1494628715776
# However, the underlying Series objects are different (newly allocated)
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494630712656, 1494630712176, 1494630712272)
通过迭代添加新行,每次迭代都是 re-creating 新系列对象,因此它很慢。在 .append()
方法下的 pandas 文档中也警告了这一点(尽管不推荐使用该参数,但该参数仍然成立):https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append
迭代地将行附加到 DataFrame 可能比单个连接的计算密集度更高。更好的解决方案是将这些行附加到列表中,然后将列表与原始 DataFrame 一起连接起来。
在调用 pd.DataFrame
之前,您最好进行迭代并附加到更适合 dynamic-sized 操作的数据结构中,例如本机 Python list
它。但是,对于简单的情况,您可以将生成器传递到 pd.DataFrame
调用中:
# No need to specify columns since you provided the dictionary with the keys
df = pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
演示jupyter notebook的区别:
def reallocating_way(rows):
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
for idx, row in enumerate(rows):
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
return df
def better_way(rows):
return pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
# Making an arbitrary list of 1000 rows
rows = [Row() for _ in range(1000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
2.45 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Making an arbitrary list of 10000 rows
rows = [Row() for _ in range(10000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
27.3 s ± 1.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.4 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000 行速度提高 1000 倍以上,10000 行速度提高 2000 倍以上
这个循环目前在我的桌面上以 5ghz (OC) 运行 花费了将近 3 个小时。我将如何加快速度?
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
idx = 0
for row in rows:
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
idx += 1
行是 JSON 存储在 (BigQuery) RowIterator 中的数据。
<google.cloud.bigquery.table.RowIterator object at 0x000001ADD93E7B50>
<class 'google.cloud.bigquery.table.RowIterator'>
JSON 数据如下:
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/index.html', 45), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 65), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/index.html', 64), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/products.html', 56), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/employees.html', 54), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact/cookies.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 91), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-ca/careers.html', 42), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/', 115), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/suppliers', 51), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/search.html', 60), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 50), {'clientId': 0, 'pagePath': 1, 'count': 2})
我 运行 跨越 BigQuery 中的 to_dataframe() 方法。极快。将 3 小时缩短为 3 秒。
df = query_job.result().to_dataframe()
google.cloud.bigquery.table.RowIterator
Downloading BigQuery data to pandas using the BigQuery Storage API
这不是您使用 pandas 数据框的方式。数据框代表 垂直数据,这意味着每一列都是引擎盖下的一个系列,它使用 fixed-sized numpy 数组(尽管相同数据类型的列的数组与其他列相邻)。
每次您将新行附加到数据框时,每一列的数组都会调整大小(即重新分配),这本身就很昂贵。您对每一行都执行此操作,这意味着您对唯一数据类型的每一列进行了 n
次数组重新分配迭代,这是非常低效的。此外,您还为每一行创建了一个 pd.Series,这会导致更多的分配,这在数据帧垂直表示数据时没有用。
您可以通过查看列的 id
来验证这一点
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['clientId', 'url', 'count'])
# Look at the ID of the DataFrame and the columns
>>> id(df)
1494628715776
# These are the IDs of the empty Series for each column
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494628789264, 1494630670400, 1494630670640)
# Assigning a series at an index that didn't exist before
>>> df.loc[0] = pd.Series({'clientId': 123, 'url': 123, 'count': 100})
# ID of the dataframe remains the same
>>> id(df)
1494628715776
# However, the underlying Series objects are different (newly allocated)
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494630712656, 1494630712176, 1494630712272)
通过迭代添加新行,每次迭代都是 re-creating 新系列对象,因此它很慢。在 .append()
方法下的 pandas 文档中也警告了这一点(尽管不推荐使用该参数,但该参数仍然成立):https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append
迭代地将行附加到 DataFrame 可能比单个连接的计算密集度更高。更好的解决方案是将这些行附加到列表中,然后将列表与原始 DataFrame 一起连接起来。
在调用 pd.DataFrame
之前,您最好进行迭代并附加到更适合 dynamic-sized 操作的数据结构中,例如本机 Python list
它。但是,对于简单的情况,您可以将生成器传递到 pd.DataFrame
调用中:
# No need to specify columns since you provided the dictionary with the keys
df = pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
演示jupyter notebook的区别:
def reallocating_way(rows):
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
for idx, row in enumerate(rows):
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
return df
def better_way(rows):
return pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
# Making an arbitrary list of 1000 rows
rows = [Row() for _ in range(1000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
2.45 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Making an arbitrary list of 10000 rows
rows = [Row() for _ in range(10000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
27.3 s ± 1.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.4 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000 行速度提高 1000 倍以上,10000 行速度提高 2000 倍以上