当数据框上有混合数据类型时,为什么我不能使用 loc 赋值?即有些列有字符串其他有数字

Why I can't assgin a value using loc when there are mixed data types on the dataframe? i.e. Some columns have strings other have numbers

我在 Python 3.6 中使用 pandas 数据框来索引文件和属性。我的初始解决方案使用数据框第一列的文件名和其他列的数字属性。

当我遍历收集属性的文件并尝试将值分配给数据框上的相应列时,这些值没有正确存储。

我尝试了几次,终于得到了一个可以工作的代码,但我不明白为什么最初的解决方案不起作用。

任何人都可以给出一些解释或者更好的解决方案来为数据框上不会触发警报的元素赋值。 (我知道在这种情况下如何关闭警报,但我宁愿不这样做)

问题在下面的代码中举例说明。如果以不同方式创建数据框并且字符串值列位于不同位置,我会得到相同的结果,例如数据框上的第二列或第三列。

没有尝试使用其他数据类型,例如 bool,但我想这个问题通常与混合数据类型的数据帧有关。

#!/usr/bin/python3

# Import standard libraries
import pandas as pd
import numpy as np

# constants used as label for harmonization with the HDF5 ontology used
ROW_LENGTH = 11
COL1 = 'x1'
COL2 = 'x2'
COL3 = 'x3'

def _main():

    # Create a dataframe
    first_df = pd.DataFrame(columns=[COL1, COL2, COL3])
    first_df[COL1] = ["foo"]*ROW_LENGTH
    first_df[COL2] = [np.NaN]*ROW_LENGTH
    first_df[COL3] = [np.NaN]*ROW_LENGTH

    # Go around assigning data
    for row in range(ROW_LENGTH):
        first_df[COL1][row] = "{}".format(row)
        first_df[COL2][row] = row*2 # Although it gives warning, it works
        first_df.loc[row][COL3] = row*3 # And this, that should work, don't

    print("Although no data was not stored on the third column using: first_df.loc[row][COL3]")
    print(first_df.head())
    print("\n...I can retrieve the data like: first_df[COL2][5] = '{}'".format(first_df[COL2][3]))
    print("... or like that: first_df.loc[5][COL2] = '{}'".format(first_df.loc[3][COL2]))

    # If the first row is numeric...
    second_df = pd.DataFrame(columns=[COL1, COL2, COL3])
    second_df[COL1] = [0.0]*ROW_LENGTH
    second_df[COL2] = [0.0]*ROW_LENGTH
    second_df[COL3] = [0.0]*ROW_LENGTH

    # Go around assigning data
    for row in range(ROW_LENGTH):
        second_df[COL1][row] = row*1.0
        second_df[COL2][row] = row*2.0
        second_df.loc[row][COL3] = row*3.0

    print("\nNow if I use only numeric columns, everything works as expected:")
    print(second_df.head())

if __name__ == '__main__':
    _main()

输出为:

Although no data was not stored on the third column using: first_df.loc[row][COL3]
  x1   x2  x3
0  0  0.0 NaN
1  1  2.0 NaN
2  2  4.0 NaN
3  3  6.0 NaN
4  4  8.0 NaN

...I can retrieve the data like: first_df[COL2][5] = '6.0'
... or like that: first_df.loc[5][COL2] = '6.0'

Now if I use only numeric columns, everything works as expected:
    x1   x2    x3
0  0.0  0.0   0.0
1  1.0  2.0   3.0
2  2.0  4.0   6.0
3  3.0  6.0   9.0
4  4.0  8.0  12.0

警告信息是这样的

./test.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  first_df[COL2][row] = row*2 # Although it gives warning, it works

可以使用以下方法消除此警告:pd.options.mode.chained_assignment = None

我想代码可以自行解释预期结果,但简而言之,我想使用 .loc 方法访问任何元素。

使用 first_df.loc[row, COL3] 而不是 first_df.loc[row][COL3]

当你使用first_df.loc[row][COL3]时,你首先用first_df.loc[row]创建一个临时Series,然后访问并修改COL3处的值,并删除这个临时Series。相当于:

tmp = first_df.loc[row]
tmp[COL3] = row*3

并且 tmp 永远不会写回初始 DataFrame。