使用 Pandas 从不使用科学记数法的 Redshift 读取 bigint (int8) 列数据

Question

我正在使用 Pandas 从 Redshift 读取数据。我有一个 bigint (int8) 列，它呈指数增长。我尝试了以下方法，但在这些情况下会截断数据。

该列中数据的样本值为：635284328055690862。读作 6.352843e+17.

我试图在 Python 中将其转换为 int64。

import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)

这种情况下的输出是：635284328055690880。在这里我丢失了我的数据，它在最后将它缩放到 0。

预期输出：635284328055690862

即使这样做，我也会得到相同的结果。

pd.set_option('display.float_format', lambda x: '%.0f' % x)

输出：635284328055690880

预期输出：635284328055690862

这似乎是正常的 Pandas 行为。我什至尝试使用列表创建一个数据框，但仍然得到相同的结果。

import pandas as pd
import numpy as np

pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)


Output:
0 635284328055690880 758364950923147648
1                nan                nan
2                  1                  3

我注意到，每当我们在数据框中有 nan 时，我们就会遇到这个问题。

我正在使用以下代码从 Redshift 获取数据。

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)

请帮我解决这个问题。提前致谢。

Answer 1

发生这种情况是因为标准整数数据类型不提供表示缺失数据的方法。由于浮点数据类型确实提供 nan，处理此问题的旧方法是将缺少数据的数字列转换为 float.

为了纠正这个问题，pandas 引入了 Nullable integer data type。如果您正在做一些像读取 csv 这样简单的事情，您可以在对 read_csv 的调用中显式指定此类型，如下所示：

>>> pandas.read_csv('sample.csv', dtype="Int64")
             column_a  column_b
0  635284328055690880     45564
1                <NA>        45
2                   1      <NA>
3                   1         5

但是，问题依旧！看起来即使 635284328055690862 可以表示为 64 位整数，在某些时候，pandas 仍然通过 floating-point 转换步骤传递值，改变值。这很奇怪，甚至可能值得向 pandas 开发人员提出这个问题。

我在这种情况下看到的最佳解决方法是使用“对象”数据类型，如下所示：

>>> pandas.read_csv('sample.csv', dtype="object")
             column_a column_b
0  635284328055690862    45564
1                 NaN       45
2                   1      NaN
3                   1        5

这保留了大整数的精确值，并且还允许 NaN 值。但是，因为这些现在是 python 对象的数组 ，compute-intensive 任务的性能会受到显着影响。此外，仔细检查，这些似乎是 Python str 对象，因此我们还需要另一个转换步骤。令我惊讶的是，没有直接的方法。这是我能做的最好的事情：

def col_to_intNA(col):
    return {ix: pandas.NA if pandas.isnull(v) else int(v)
            for ix, v in col.to_dict().items()}

sample = {col: col_to_intNA(sample[col])
          for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")

这给出了期望的结果：

>>> sample
             column_a  column_b
0  635284328055690862     45564
1                <NA>        45
2                   1      <NA>
3                   1         5
>>> sample.dtypes
column_a    Int64
column_b    Int64
dtype: object

这样就解决了一个问题。但是第二个问题出现了，因为要从 Redshift 数据库读取，您通常会使用 read_sql，它不提供任何指定数据类型的方法。

所以我们将自己推出！这是基于您发布的代码以及 pandas_redshift library. It uses psycopg2 directly, rather than using sqlalchemy, because I am not sure sqlalchemy provides a cursor_factory parameter that accepts a RealDictCursor 中的一些代码。警告：我根本没有测试过这个，因为我懒得设置 postgres 数据库只是为了测试 Whosebug 的答案！我认为它应该有效，但我不确定。请让我知道它是否有效and/or需要更正的地方。

import psycopg2
from psycopg2.extras import RealDictCursor  # Turn rows into proper dicts.

import pandas

def row_null_to_NA(row):
    return {col: pandas.NA if pandas.isnull(val) else val
            for col, val in row.items()}

connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'

try:  # `with conn:` only closes the transaction, not the connection 
    conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
    cursor = conn.cursor()
    cursor.execute('''select * from schema.table_name''')

    # The DataFrame constructor accepts generators of dictionary rows.
    df = pandas.DataFrame(
        (row_null_to_NA(row) for row in cursor.fetchall()), 
        dtype="Int64"
    )
finally:
    conn.close()

print(df)

请注意，这假定您的所有列都是整数列。如果没有，您可能需要加载数据 column-by-column。

Answer 2

其中一个修复可以代替 select * from schema.table_name。您可以分别传递所有列，然后投射特定列。

假设您在 table 中有 5 列，col2 是 bigint(int8) 列。所以，你可以像下面这样阅读：

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)

P.S.: 我不确定这是最聪明的解决方案，但从逻辑上讲，如果 python 无法正确转换为 int64 那么我们可以从 SQL 本身读取铸造值。

此外，如果长度超过 17，我想尝试动态转换 int 列。

使用 Pandas 从不使用科学记数法的 Redshift 读取 bigint (int8) 列数据

Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas

python

precision

numpy

pandas

amazon-redshift