在 pandas 中使用 groupby 以模式替换缺失值时出现 IndexError
IndexError when replacing missing values with mode using groupby in pandas
我有一个数据集需要缺失值处理。
Column Missing Values
Complaint_ID 0
Date_received 0
Transaction_Type 0
Complaint_reason 0
Company_response 22506
Date_sent_to_company 0
Complaint_Status 0
Consumer_disputes 7698
现在的问题是,当我尝试使用 groupby
:
将丢失的 values
替换为其他 columns
的模式时
代码:
data11["Company_response"] =
data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()
[0]))["Company_response"]
data11["Consumer_disputes"] =
data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()
[0]))["Consumer_disputes"]
我收到以下错误:
堆栈跟踪
Traceback (most recent call last):
File "<ipython-input-89-8de6a010a299>", line 1, in <module>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3741, in transform
return self._transform_general(func, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3699, in _transform_general
res = path(group)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4360, in apply
ignore_failures=ignore_failures)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4456, in _apply_standard
results[i] = func(v)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "<ipython-input-89-8de6a010a299>", line 1, in <lambda>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2434, in get_value
return libts.get_value_box(s, key)
File "pandas\_libs\tslib.pyx", line 923, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18843)
File "pandas\_libs\tslib.pyx", line 939, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18560)
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
我检查了 dataframe
的 length
及其所有列,结果相同:43266。
我也发现了类似的问题,但没有正确答案:Click here
请帮忙解决错误。
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
如果有任何帮助,这里是数据集的快照:Dataset Snapshot
我成功地使用了下面的代码。但这并不完全符合我的目的。不过有助于填补缺失值。
data11['Company_response'].fillna(data11['Company_response'].mode()[0],
inplace=True)
data11['Consumer_disputes'].fillna(data11['Consumer_disputes'].mode()[0],
inplace=True)
编辑 1:(附加样本)
给定的输入:
预期输出:
可以看到Tr-1和Tr-3的company-response的缺失值是采用Complaint-Reason的方式进行填充的。
同样,对于 Tr-5,采用交易类型模式的消费者纠纷。
下面的代码片段包含数据框和代码,供想要复制和尝试的人使用。
复制代码
import pandas as pd
import numpy as np
data11=pd.DataFrame({'Complaint_ID':['Tr-1','Tr-2','Tr-3','Tr-4','Tr-5','Tr-6'],
'Transaction_Type':['Mortgage','Credit card','Bank account or service','Debt collection','Credit card','Mortgage'],
'Complaint_reason':['Loan servicing, payments, escrow account','Incorrect information on credit report',"Cont'd attempts collect debt not owed","Cont'd attempts collect debt not owed",'Payoff process','Loan servicing, payments, escrow account'],
'Company_response':[np.nan,'Company chooses not to provide a public response',np.nan,'Company believes it acted appropriately as authorized by contract or law','Company has responded to the consumer and the CFPB and chooses not to provide a public response','Company disputes the facts presented in the complaint'],
'Consumer_disputes':['Yes','No','No','No',np.nan,'Yes']})
data11.isnull().sum()
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
data11["Consumer_disputes"] = data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()[0]))["Consumer_disputes"]
出现错误是因为对于至少一组,相应聚合列中的值仅包含 np.nan 个值。在这种情况下,pd.Series([np.nan]).mode()
returns 是一个空系列,当您取第一个值时会导致错误。
因此,您可以使用 transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty") )
.
尝试:
data11["Company_response"] = data11.groupby("Complaint_reason")['Company_response'].transform(lambda x: x.fillna(x.mode()[0]))
data11["Consumer_disputes"] = data11.groupby("Transaction_Type")['Consumer_disputes'].transform(lambda x: x.fillna(x.mode()[0]))
@Mikhail Berlinkov 几乎可以肯定是正确的。我能够重现您的错误,然后使用 dropna()
:
避免它
data11.groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Returns IndexError
data11.dropna().groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Works
我有一个数据集需要缺失值处理。
Column Missing Values
Complaint_ID 0
Date_received 0
Transaction_Type 0
Complaint_reason 0
Company_response 22506
Date_sent_to_company 0
Complaint_Status 0
Consumer_disputes 7698
现在的问题是,当我尝试使用 groupby
:
values
替换为其他 columns
的模式时
代码:
data11["Company_response"] =
data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()
[0]))["Company_response"]
data11["Consumer_disputes"] =
data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()
[0]))["Consumer_disputes"]
我收到以下错误:
堆栈跟踪
Traceback (most recent call last):
File "<ipython-input-89-8de6a010a299>", line 1, in <module>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3741, in transform
return self._transform_general(func, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3699, in _transform_general
res = path(group)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4360, in apply
ignore_failures=ignore_failures)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4456, in _apply_standard
results[i] = func(v)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "<ipython-input-89-8de6a010a299>", line 1, in <lambda>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2434, in get_value
return libts.get_value_box(s, key)
File "pandas\_libs\tslib.pyx", line 923, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18843)
File "pandas\_libs\tslib.pyx", line 939, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18560)
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
我检查了 dataframe
的 length
及其所有列,结果相同:43266。
我也发现了类似的问题,但没有正确答案:Click here
请帮忙解决错误。
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
如果有任何帮助,这里是数据集的快照:Dataset Snapshot
我成功地使用了下面的代码。但这并不完全符合我的目的。不过有助于填补缺失值。
data11['Company_response'].fillna(data11['Company_response'].mode()[0],
inplace=True)
data11['Consumer_disputes'].fillna(data11['Consumer_disputes'].mode()[0],
inplace=True)
编辑 1:(附加样本)
给定的输入:
预期输出:
可以看到Tr-1和Tr-3的company-response的缺失值是采用Complaint-Reason的方式进行填充的。 同样,对于 Tr-5,采用交易类型模式的消费者纠纷。
下面的代码片段包含数据框和代码,供想要复制和尝试的人使用。
复制代码
import pandas as pd
import numpy as np
data11=pd.DataFrame({'Complaint_ID':['Tr-1','Tr-2','Tr-3','Tr-4','Tr-5','Tr-6'],
'Transaction_Type':['Mortgage','Credit card','Bank account or service','Debt collection','Credit card','Mortgage'],
'Complaint_reason':['Loan servicing, payments, escrow account','Incorrect information on credit report',"Cont'd attempts collect debt not owed","Cont'd attempts collect debt not owed",'Payoff process','Loan servicing, payments, escrow account'],
'Company_response':[np.nan,'Company chooses not to provide a public response',np.nan,'Company believes it acted appropriately as authorized by contract or law','Company has responded to the consumer and the CFPB and chooses not to provide a public response','Company disputes the facts presented in the complaint'],
'Consumer_disputes':['Yes','No','No','No',np.nan,'Yes']})
data11.isnull().sum()
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
data11["Consumer_disputes"] = data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()[0]))["Consumer_disputes"]
出现错误是因为对于至少一组,相应聚合列中的值仅包含 np.nan 个值。在这种情况下,pd.Series([np.nan]).mode()
returns 是一个空系列,当您取第一个值时会导致错误。
因此,您可以使用 transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty") )
.
尝试:
data11["Company_response"] = data11.groupby("Complaint_reason")['Company_response'].transform(lambda x: x.fillna(x.mode()[0]))
data11["Consumer_disputes"] = data11.groupby("Transaction_Type")['Consumer_disputes'].transform(lambda x: x.fillna(x.mode()[0]))
@Mikhail Berlinkov 几乎可以肯定是正确的。我能够重现您的错误,然后使用 dropna()
:
data11.groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Returns IndexError
data11.dropna().groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Works