如何遍历 pandas 数据框的列并相应地删除行? (木星笔记本)
How do I iterate through the column of a pandas dataframe and delete rows accordingly? (Jupyter Notebook)
这是我现在正在使用的数据框:
Season
Team
W
L
W/L%
Coaches
1
2020-21
Atlanta Hawks*
41
31
0.569
L. Pierce (14-20) N. McMillan (27-11)
2
2019-20
Atlanta Hawks
20
47
0.299
L. Pierce (20-47)
3
2018-19
Atlanta Hawks
29
53
0.354
L. Pierce (29-53)
4
2017-18
Atlanta Hawks
24
58
0.293
M. Budenholzer (24-58)
5
2016-17
Atlanta Hawks*
43
39
0.524
M. Budenholzer (43-39)
6
2015-16
Atlanta Hawks*
48
34
0.585
M. Budenholzer (48-34)
7
2014-15
Atlanta Hawks*
60
22
0.732
M. Budenholzer (60-22)
8
2013-14
Atlanta Hawks*
38
44
0.463
M. Budenholzer (38-44)
9
2012-13
Atlanta Hawks*
44
38
0.537
L. Drew (44-38)
10
2011-12
Atlanta Hawks*
40
26
0.606
L. Drew (40-26)
11
2010-11
Atlanta Hawks*
44
38
0.537
L. Drew (44-38)
12
2009-10
Atlanta Hawks*
53
29
0.646
M. Woodson (53-29)
我基本上只想让教练的名字在每一年都不同的行。因此,例如,我会保留第 4 行,因为它的紧邻行第 3 行在 'Coaches' 列下具有不同的名称,但我会删除第 5、6 和 7 行,因为 'Coaches' 列中的名称所有三行都相同。但是,我想保留第 8 行,因为第 9 行(相邻行)在 'Coaches.'
中有不同的名称
我通过读取 csv 文件获得了这个数据框
df = pd.read_csv("hawks.csv")
我想我应该调用 df.iloc
但我不知道如何遍历每一行并比较列中的值。到目前为止,我只设法打印 'Coaches' 列中的字符串值,如下所示:
coaches = df.iloc[0:, 7]
for name in coaches:
print(name)
但我想知道如何在遍历每一行时获取存储在 'Coaches' 列中的值(然后删除不符合我正在查找的条件的行)。非常感谢!
您可以使用drop_duplicates
方法。
comp = re.compile(r'[A-Z].\s\w+')
df['Coaches_Names'] = df['Coaches'].apply(lambda x: ' - '.join(comp.findall(x))) # New column which includes only coache's names
df['Team'] = df['Team'].apply(lambda x: ''.join(re.sub('\*', '', x))) # Dropping '*' characters from name of teams
df.drop_duplicates(subset=['Team', 'Coaches_Names'], inplace=True)
print(df)
Season Team W L W/L% Coaches Coaches_Names
0 2020-21 Atlanta Hawks 41 31 0.569 L. Pierce (14-20) N. McMillan (27-11) L. Pierce - N. McMillan
1 2019-20 Atlanta Hawks 20 47 0.299 L. Pierce (20-47) L. Pierce
3 2017-18 Atlanta Hawks 24 58 0.293 M. Budenholzer (24-58) M. Budenholzer
8 2012-13 Atlanta Hawks 44 38 0.537 L. Drew (44-38) L. Drew
11 2009-10 Atlanta Hawks 53 29 0.646 M. Woodson (53-29) M. Woodson
是否删除Coaches_Names
列由您决定。
此外,您可以更喜欢使用 keep
参数。它允许您在第一年或去年进行选择。有关更多信息,您可以查看文档。 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
编辑
代码被编辑以格式化列,@RJ,Adriaansen
您可以通过将连续行与 shift
:
进行比较来过滤该列
import pandas as pd
data = [ { "idx": 1, "Season": "2020-21", "Team": "Atlanta Hawks*", "W": 41, "L": 31, "Coaches": "L. Pierce (14-20) N. McMillan (27-11)" }, { "idx": 2, "Season": "2019-20", "Team": "Atlanta Hawks", "W": 20, "L": 47, "Coaches": "L. Pierce (20-47)" }, { "idx": 3, "Season": "2018-19", "Team": "Atlanta Hawks", "W": 29, "L": 53, "Coaches": "L. Pierce (29-53)" }, { "idx": 4, "Season": "2017-18", "Team": "Atlanta Hawks", "W": 24, "L": 58, "Coaches": "M. Budenholzer (24-58)" }, { "idx": 5, "Season": "2016-17", "Team": "Atlanta Hawks*", "W": 43, "L": 39, "Coaches": "M. Budenholzer (43-39)" }, { "idx": 6, "Season": "2015-16", "Team": "Atlanta Hawks*", "W": 48, "L": 34, "Coaches": "M. Budenholzer (48-34)" }, { "idx": 7, "Season": "2014-15", "Team": "Atlanta Hawks*", "W": 60, "L": 22, "Coaches": "M. Budenholzer (60-22)" }, { "idx": 8, "Season": "2013-14", "Team": "Atlanta Hawks*", "W": 38, "L": 44, "Coaches": "M. Budenholzer (38-44)" }, { "idx": 9, "Season": "2012-13", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 10, "Season": "2011-12", "Team": "Atlanta Hawks*", "W": 40, "L": 26, "Coaches": "L. Drew (40-26)" }, { "idx": 11, "Season": "2010-11", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 12, "Season": "2009-10", "Team": "Atlanta Hawks*", "W": 53, "L": 29, "Coaches": "M. Woodson (53-29)" } ]
df = pd.DataFrame(data)
df[(df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(1).str.split('(').str[0]) | (df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(-1).str.split('(').str[0])]
输出:
idx
Season
Team
W
L
Coaches
0
1
2020-21
Atlanta Hawks*
41
31
L. Pierce (14-20) N. McMillan (27-11)
2
3
2018-19
Atlanta Hawks
29
53
L. Pierce (29-53)
3
4
2017-18
Atlanta Hawks
24
58
M. Budenholzer (24-58)
7
8
2013-14
Atlanta Hawks*
38
44
M. Budenholzer (38-44)
8
9
2012-13
Atlanta Hawks*
44
38
L. Drew (44-38)
10
11
2010-11
Atlanta Hawks*
44
38
L. Drew (44-38)
11
12
2009-10
Atlanta Hawks*
53
29
M. Woodson (53-29)
我不确定你是否想把 Pierce/McMillan 算作皮尔斯的第一次提及,我的回答是这样假设的。如果您想将其计为单独的点击,只需将 split('(')
替换为 rsplit('(', 1)
。
遍历数据帧:
it = df.iterrows() #get iterator
for index, row in it:
# to delete row
if cond:
df.drop([row], inplace=True)
# reset indices if needed
df.reset_index(drop=True, inplace=True)
这是我现在正在使用的数据框:
Season | Team | W | L | W/L% | Coaches | |
---|---|---|---|---|---|---|
1 | 2020-21 | Atlanta Hawks* | 41 | 31 | 0.569 | L. Pierce (14-20) N. McMillan (27-11) |
2 | 2019-20 | Atlanta Hawks | 20 | 47 | 0.299 | L. Pierce (20-47) |
3 | 2018-19 | Atlanta Hawks | 29 | 53 | 0.354 | L. Pierce (29-53) |
4 | 2017-18 | Atlanta Hawks | 24 | 58 | 0.293 | M. Budenholzer (24-58) |
5 | 2016-17 | Atlanta Hawks* | 43 | 39 | 0.524 | M. Budenholzer (43-39) |
6 | 2015-16 | Atlanta Hawks* | 48 | 34 | 0.585 | M. Budenholzer (48-34) |
7 | 2014-15 | Atlanta Hawks* | 60 | 22 | 0.732 | M. Budenholzer (60-22) |
8 | 2013-14 | Atlanta Hawks* | 38 | 44 | 0.463 | M. Budenholzer (38-44) |
9 | 2012-13 | Atlanta Hawks* | 44 | 38 | 0.537 | L. Drew (44-38) |
10 | 2011-12 | Atlanta Hawks* | 40 | 26 | 0.606 | L. Drew (40-26) |
11 | 2010-11 | Atlanta Hawks* | 44 | 38 | 0.537 | L. Drew (44-38) |
12 | 2009-10 | Atlanta Hawks* | 53 | 29 | 0.646 | M. Woodson (53-29) |
我基本上只想让教练的名字在每一年都不同的行。因此,例如,我会保留第 4 行,因为它的紧邻行第 3 行在 'Coaches' 列下具有不同的名称,但我会删除第 5、6 和 7 行,因为 'Coaches' 列中的名称所有三行都相同。但是,我想保留第 8 行,因为第 9 行(相邻行)在 'Coaches.'
中有不同的名称我通过读取 csv 文件获得了这个数据框
df = pd.read_csv("hawks.csv")
我想我应该调用 df.iloc
但我不知道如何遍历每一行并比较列中的值。到目前为止,我只设法打印 'Coaches' 列中的字符串值,如下所示:
coaches = df.iloc[0:, 7]
for name in coaches:
print(name)
但我想知道如何在遍历每一行时获取存储在 'Coaches' 列中的值(然后删除不符合我正在查找的条件的行)。非常感谢!
您可以使用drop_duplicates
方法。
comp = re.compile(r'[A-Z].\s\w+')
df['Coaches_Names'] = df['Coaches'].apply(lambda x: ' - '.join(comp.findall(x))) # New column which includes only coache's names
df['Team'] = df['Team'].apply(lambda x: ''.join(re.sub('\*', '', x))) # Dropping '*' characters from name of teams
df.drop_duplicates(subset=['Team', 'Coaches_Names'], inplace=True)
print(df)
Season Team W L W/L% Coaches Coaches_Names
0 2020-21 Atlanta Hawks 41 31 0.569 L. Pierce (14-20) N. McMillan (27-11) L. Pierce - N. McMillan
1 2019-20 Atlanta Hawks 20 47 0.299 L. Pierce (20-47) L. Pierce
3 2017-18 Atlanta Hawks 24 58 0.293 M. Budenholzer (24-58) M. Budenholzer
8 2012-13 Atlanta Hawks 44 38 0.537 L. Drew (44-38) L. Drew
11 2009-10 Atlanta Hawks 53 29 0.646 M. Woodson (53-29) M. Woodson
是否删除Coaches_Names
列由您决定。
此外,您可以更喜欢使用 keep
参数。它允许您在第一年或去年进行选择。有关更多信息,您可以查看文档。 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
编辑
代码被编辑以格式化列,@RJ,Adriaansen
您可以通过将连续行与 shift
:
import pandas as pd
data = [ { "idx": 1, "Season": "2020-21", "Team": "Atlanta Hawks*", "W": 41, "L": 31, "Coaches": "L. Pierce (14-20) N. McMillan (27-11)" }, { "idx": 2, "Season": "2019-20", "Team": "Atlanta Hawks", "W": 20, "L": 47, "Coaches": "L. Pierce (20-47)" }, { "idx": 3, "Season": "2018-19", "Team": "Atlanta Hawks", "W": 29, "L": 53, "Coaches": "L. Pierce (29-53)" }, { "idx": 4, "Season": "2017-18", "Team": "Atlanta Hawks", "W": 24, "L": 58, "Coaches": "M. Budenholzer (24-58)" }, { "idx": 5, "Season": "2016-17", "Team": "Atlanta Hawks*", "W": 43, "L": 39, "Coaches": "M. Budenholzer (43-39)" }, { "idx": 6, "Season": "2015-16", "Team": "Atlanta Hawks*", "W": 48, "L": 34, "Coaches": "M. Budenholzer (48-34)" }, { "idx": 7, "Season": "2014-15", "Team": "Atlanta Hawks*", "W": 60, "L": 22, "Coaches": "M. Budenholzer (60-22)" }, { "idx": 8, "Season": "2013-14", "Team": "Atlanta Hawks*", "W": 38, "L": 44, "Coaches": "M. Budenholzer (38-44)" }, { "idx": 9, "Season": "2012-13", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 10, "Season": "2011-12", "Team": "Atlanta Hawks*", "W": 40, "L": 26, "Coaches": "L. Drew (40-26)" }, { "idx": 11, "Season": "2010-11", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 12, "Season": "2009-10", "Team": "Atlanta Hawks*", "W": 53, "L": 29, "Coaches": "M. Woodson (53-29)" } ]
df = pd.DataFrame(data)
df[(df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(1).str.split('(').str[0]) | (df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(-1).str.split('(').str[0])]
输出:
idx | Season | Team | W | L | Coaches | |
---|---|---|---|---|---|---|
0 | 1 | 2020-21 | Atlanta Hawks* | 41 | 31 | L. Pierce (14-20) N. McMillan (27-11) |
2 | 3 | 2018-19 | Atlanta Hawks | 29 | 53 | L. Pierce (29-53) |
3 | 4 | 2017-18 | Atlanta Hawks | 24 | 58 | M. Budenholzer (24-58) |
7 | 8 | 2013-14 | Atlanta Hawks* | 38 | 44 | M. Budenholzer (38-44) |
8 | 9 | 2012-13 | Atlanta Hawks* | 44 | 38 | L. Drew (44-38) |
10 | 11 | 2010-11 | Atlanta Hawks* | 44 | 38 | L. Drew (44-38) |
11 | 12 | 2009-10 | Atlanta Hawks* | 53 | 29 | M. Woodson (53-29) |
我不确定你是否想把 Pierce/McMillan 算作皮尔斯的第一次提及,我的回答是这样假设的。如果您想将其计为单独的点击,只需将 split('(')
替换为 rsplit('(', 1)
。
遍历数据帧:
it = df.iterrows() #get iterator
for index, row in it:
# to delete row
if cond:
df.drop([row], inplace=True)
# reset indices if needed
df.reset_index(drop=True, inplace=True)