Python 中的特征工程 Pandas 每次计算使用多行
Feature Engineering in Python with Pandas Using Multiple Rows Per Calculation
我有以下格式的 CSV 数据:
+-----------------+--------+-------------+
| reservation_num | rate | guest_name |
+-----------------+--------+-------------+
| B874576 | 169.95 | Bob Smith |
| H786234 | 258.95 | Jane Doe |
| H786234 | 258.95 | John Doe |
| F987354 | 385.95 | David Jones |
| N097897 | 449.95 | Mark Davis |
| H567349 | 482.95 | Larry Stein |
| N097897 | 449.95 | Sue Miller |
+-----------------+--------+-------------+
我想向名为 'rate_per_person' 的 DataFrame 添加一个特征(列)。它的计算方法是将特定预订编号的房价除以拥有相同预订编号并与其住宿相关联的客人总数。
这是我的代码:
#Importing Libraries
import pandas as pd
# Importing the Dataset
ds = pd.read_csv('hotels.csv')
for index, row in ds.iterrows():
row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count
错误信息:
Traceback (most recent call last):
File "<ipython-input-3-0668a3165e76>", line 2, in <module>
row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__
return self._getitem_column(key)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
return self._get_item_cache(key)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
values = self._data.get(item)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'B874576'
根据错误消息,显然最后一行代码的 ds[row['reservation_num']].count
部分存在问题。但是,我不确定以允许我以编程方式创建新功能的方式获取每次预订的客人数量的正确方法。
选项 1
pd.Series.value_counts
和 map
df.rate / df.reservation_num.map(df.reservation_num.value_counts())
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 2
groupby
、transform
和 size
df.rate / df.groupby('reservation_num').rate.transform('size')
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 3
np.unique
和 np.bincount
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 3.5
np.unique
排序,因此缩放比例不如 pd.factorize
。在我使用它们的上下文中,它们做同样的事情。因此,我使用了一个函数,该函数使用一个关于数组长度的轶事阈值,其中一个变得比另一个更高效。它被编号为 3.5
因为它基本上与 3
完全相同
def factor(a):
if len(a) > 10000:
return pd.factorize(a)[0]
else:
return np.unique(a, return_inverse=True)[1]
def count(a):
f = factor(a)
return np.bincount(f)[f]
df.rate / count(df.reservation_num.values)
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
时机
%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())
%timeit df.rate / df.groupby('reservation_num').rate.transform('size')
1000 loops, best of 3: 650 µs per loop
1000 loops, best of 3: 768 µs per loop
%%timeit
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]
10000 loops, best of 3: 131 µs per loop
您可以使用 grouppby
和 transform
执行此操作:
df['rate_per_person'] = df.groupby('reservation_num')['rate'].transform(lambda x: x.iloc[0] / x.size)
输出:
reservation_num rate guest_name rate_per_person
0 B874576 169.95 Bob Smith 169.950
1 H786234 258.95 Jane Doe 129.475
2 H786234 258.95 John Doe 129.475
3 F987354 385.95 David Jones 385.950
4 N097897 449.95 Mark Davis 224.975
5 H567349 482.95 Larry Stein 482.950
6 N097897 449.95 Sue Miller 224.975
我有以下格式的 CSV 数据:
+-----------------+--------+-------------+
| reservation_num | rate | guest_name |
+-----------------+--------+-------------+
| B874576 | 169.95 | Bob Smith |
| H786234 | 258.95 | Jane Doe |
| H786234 | 258.95 | John Doe |
| F987354 | 385.95 | David Jones |
| N097897 | 449.95 | Mark Davis |
| H567349 | 482.95 | Larry Stein |
| N097897 | 449.95 | Sue Miller |
+-----------------+--------+-------------+
我想向名为 'rate_per_person' 的 DataFrame 添加一个特征(列)。它的计算方法是将特定预订编号的房价除以拥有相同预订编号并与其住宿相关联的客人总数。
这是我的代码:
#Importing Libraries
import pandas as pd
# Importing the Dataset
ds = pd.read_csv('hotels.csv')
for index, row in ds.iterrows():
row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count
错误信息:
Traceback (most recent call last):
File "<ipython-input-3-0668a3165e76>", line 2, in <module>
row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__
return self._getitem_column(key)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
return self._get_item_cache(key)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
values = self._data.get(item)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'B874576'
根据错误消息,显然最后一行代码的 ds[row['reservation_num']].count
部分存在问题。但是,我不确定以允许我以编程方式创建新功能的方式获取每次预订的客人数量的正确方法。
选项 1
pd.Series.value_counts
和 map
df.rate / df.reservation_num.map(df.reservation_num.value_counts())
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 2
groupby
、transform
和 size
df.rate / df.groupby('reservation_num').rate.transform('size')
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 3
np.unique
和 np.bincount
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
选项 3.5
np.unique
排序,因此缩放比例不如 pd.factorize
。在我使用它们的上下文中,它们做同样的事情。因此,我使用了一个函数,该函数使用一个关于数组长度的轶事阈值,其中一个变得比另一个更高效。它被编号为 3.5
因为它基本上与 3
def factor(a):
if len(a) > 10000:
return pd.factorize(a)[0]
else:
return np.unique(a, return_inverse=True)[1]
def count(a):
f = factor(a)
return np.bincount(f)[f]
df.rate / count(df.reservation_num.values)
0 169.950
1 129.475
2 129.475
3 385.950
4 224.975
5 482.950
6 224.975
dtype: float64
时机
%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())
%timeit df.rate / df.groupby('reservation_num').rate.transform('size')
1000 loops, best of 3: 650 µs per loop
1000 loops, best of 3: 768 µs per loop
%%timeit
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]
10000 loops, best of 3: 131 µs per loop
您可以使用 grouppby
和 transform
执行此操作:
df['rate_per_person'] = df.groupby('reservation_num')['rate'].transform(lambda x: x.iloc[0] / x.size)
输出:
reservation_num rate guest_name rate_per_person
0 B874576 169.95 Bob Smith 169.950
1 H786234 258.95 Jane Doe 129.475
2 H786234 258.95 John Doe 129.475
3 F987354 385.95 David Jones 385.950
4 N097897 449.95 Mark Davis 224.975
5 H567349 482.95 Larry Stein 482.950
6 N097897 449.95 Sue Miller 224.975