如何按值对多个列执行有序选择
How do I perform ordered selection on multiple Columns by Value
我有一个包含月份和年份列的数据框。两者都包含字符串,即 'September' 和 '2013'。如何 select 2013 年 9 月到 2008 年 5 月之间的所有行在一行中?
df1 = stats_month_census_2[(stats_month_census_2['year'] <= '2013')
& (stats_month_census_2['year'] >= '2008')]
df2 = df1[...]
在上面的代码之后,我打算再次做同样的事情,但我很难想出巧妙的代码来简单地删除时间早于 2013 年 9 月的行('October to December') 和 2008 年 5 月以下。我可以轻松地对此进行硬编码,但必须有一种更 pythonic 的方式来做到这一点...
您可以通过 partial string indexing
:
创建 DatetimeIndex
然后 select
stats_month_census_2 = pd.DataFrame({
'year': [2008, 2008, 2008, 2013,2013],
'month': ['April','May','June','September','October'],
'data':[1,3,4,6,5]
})
print (stats_month_census_2)
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
s = stats_month_census_2.pop('year').astype(str) + stats_month_census_2.pop('month')
#if need year and month columns
#s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2.index = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008':'2013'])
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008-05':'2013-09'])
data
2008-05-01 3
2008-06-01 4
2013-09-01 6
或创建列并使用 between
with boolean indexing
:
s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2['date'] = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
4 2013 October 5 2013-10-01
df = stats_month_census_2[stats_month_census_2['date'].between('2008-05', '2013-09')]
print (df)
year month data date
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
不幸的是,这种方法在 select 年内无法使用日期时间列,然后需要 pygo
解决方案 year
列:
#wrong output
df = stats_month_census_2[stats_month_census_2['date'].between('2008', '2013')]
print (df)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
您可以使用 pd.to_datetime
轻松地将列转换为 DateTime 列
>>df
month year
0 January 2000
1 April 2001
2 July 2002
3 February 2010
4 February 2018
5 March 2014
6 June 2012
7 June 2011
8 May 2009
9 November 2016
>>df['date'] = pd.to_datetime(df['month'].astype(str) + '-' + df['year'].astype(str), format='%B-%Y')
>>df
month year date
0 January 2000 2000-01-01
1 April 2001 2001-04-01
2 July 2002 2002-07-01
3 February 2010 2010-02-01
4 February 2018 2018-02-01
5 March 2014 2014-03-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
9 November 2016 2016-11-01
>>df[(df.date <= "2013-09") & (df.date >= "2008-05") ]
month year date
3 February 2010 2010-02-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
或者,如果您在 post "select all rows between September 2013 and May 2008" 中查找 2008 年至 2013 年之间的行,您可以尝试以下操作
然后使用 pandas.Series.between:
数据集借自@jezrael..
用于演示目的的数据帧:
>>> stats_month_census_2
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
使用pandas.Series.between()
>>> stats_month_census_2[stats_month_census_2['year'].between(2008, 2013, inclusive=True)]
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
如果只是 datetime
格式的问题,您可以简单地尝试以下操作:
>>> stats_month_census_2[stats_month_census_2['year'].between('2008-05', '2013-09', inclusive=True)]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用DataFame.query:
>>> stats_month_census_2.query('"2008-05" <= year <= "2013-09"')
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用isin方法: Select两个日期之间的行
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05-01', '2013-09-01'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
或者,你也可以像下面这样通过..
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05', '2013-09'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用 loc
方法根据索引开始和结束日期进行切片..
Start = stats_month_census_2[stats_month_census_2['year'] =='2008-05'].index[0]
End = stats_month_census_2[stats_month_census_2['year']=='2013-09'].index[0]
>>> stats_month_census_2.loc[Start:End]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
注意: 出于好奇,正如@jezrael 在评论中提出的那样,我添加了如何将 year
列转换为日期时间格式:
因为我们有下面的示例 DataFrame,其中我们有两个不同的列 year
和 month
,其中年列只有年,月列是文字字符串格式 所以,首先我们需要转换将字符串转换为 int 形式,通过使用 pandas pd.to_datetime
方法将所有日期指定为 1,将年份和月份连接或相加。
df
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
以上是日期时间转换之前的原始 DataFrame 因此,我采用了以下我在 vi SO 本身中学到的方法。
1- 首先将 month
名称转换为 int 形式,并将其分配给一个名为 Month
的新列作为简单的操作所以,我们可以在以后使用它进行转换。
df['Month'] = pd.to_datetime(df.month, format='%B').dt.month
2- 其次,或者最后直接将年份列转换为适当的 datetime
格式,方法是直接分配给 year
列本身,我们可以说这是一种就地。
df['Date'] = pd.to_datetime(df[['year', 'Month']].assign(Day=1))
Now the Desired DataFrame and year
column is in datetime Form:
print(df)
year month data Month
0 2008-04-01 April 1 4
1 2008-05-01 May 3 5
2 2008-06-01 June 4 6
3 2013-09-01 September 6 9
4 2013-10-01 October 5 10
5 2014-11-01 November 6 11
6 2014-12-01 December 7 12
另一个解决方案:
假设 df 如下所示:
series name Month Year
0 fertility rate May 2008
1 CO2 emissions June 2009
2 fertility rate September 2013
3 fertility rate October 2013
4 CO2 emissions December 2014
创建日历字典映射并保存在新列中
import calendar
d = dict((v,k) for k,v in enumerate(calendar.month_abbr))
stats_month_census_2['month_int'] = stats_month_census_2.Month.apply(lambda x: x[:3]).map(d)
>>stats_month_census_2
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
3 fertility rate October 2013 10
4 CO2 emissions December 2014 12
过滤器使用 series.between()
stats_month_census_2[stats_month_census_2.month_int.between(5,9,inclusive=True) & stats_month_census_2.Year.between(2008,2013,inclusive=True)]
输出:
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
我有一个包含月份和年份列的数据框。两者都包含字符串,即 'September' 和 '2013'。如何 select 2013 年 9 月到 2008 年 5 月之间的所有行在一行中?
df1 = stats_month_census_2[(stats_month_census_2['year'] <= '2013')
& (stats_month_census_2['year'] >= '2008')]
df2 = df1[...]
在上面的代码之后,我打算再次做同样的事情,但我很难想出巧妙的代码来简单地删除时间早于 2013 年 9 月的行('October to December') 和 2008 年 5 月以下。我可以轻松地对此进行硬编码,但必须有一种更 pythonic 的方式来做到这一点...
您可以通过 partial string indexing
:
DatetimeIndex
然后 select
stats_month_census_2 = pd.DataFrame({
'year': [2008, 2008, 2008, 2013,2013],
'month': ['April','May','June','September','October'],
'data':[1,3,4,6,5]
})
print (stats_month_census_2)
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
s = stats_month_census_2.pop('year').astype(str) + stats_month_census_2.pop('month')
#if need year and month columns
#s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2.index = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008':'2013'])
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008-05':'2013-09'])
data
2008-05-01 3
2008-06-01 4
2013-09-01 6
或创建列并使用 between
with boolean indexing
:
s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2['date'] = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
4 2013 October 5 2013-10-01
df = stats_month_census_2[stats_month_census_2['date'].between('2008-05', '2013-09')]
print (df)
year month data date
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
不幸的是,这种方法在 select 年内无法使用日期时间列,然后需要 pygo
解决方案 year
列:
#wrong output
df = stats_month_census_2[stats_month_census_2['date'].between('2008', '2013')]
print (df)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
您可以使用 pd.to_datetime
轻松地将列转换为 DateTime 列>>df
month year
0 January 2000
1 April 2001
2 July 2002
3 February 2010
4 February 2018
5 March 2014
6 June 2012
7 June 2011
8 May 2009
9 November 2016
>>df['date'] = pd.to_datetime(df['month'].astype(str) + '-' + df['year'].astype(str), format='%B-%Y')
>>df
month year date
0 January 2000 2000-01-01
1 April 2001 2001-04-01
2 July 2002 2002-07-01
3 February 2010 2010-02-01
4 February 2018 2018-02-01
5 March 2014 2014-03-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
9 November 2016 2016-11-01
>>df[(df.date <= "2013-09") & (df.date >= "2008-05") ]
month year date
3 February 2010 2010-02-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
或者,如果您在 post "select all rows between September 2013 and May 2008" 中查找 2008 年至 2013 年之间的行,您可以尝试以下操作 然后使用 pandas.Series.between:
数据集借自@jezrael..
用于演示目的的数据帧:
>>> stats_month_census_2
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
使用pandas.Series.between()
>>> stats_month_census_2[stats_month_census_2['year'].between(2008, 2013, inclusive=True)]
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
如果只是 datetime
格式的问题,您可以简单地尝试以下操作:
>>> stats_month_census_2[stats_month_census_2['year'].between('2008-05', '2013-09', inclusive=True)]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用DataFame.query:
>>> stats_month_census_2.query('"2008-05" <= year <= "2013-09"')
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用isin方法: Select两个日期之间的行
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05-01', '2013-09-01'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
或者,你也可以像下面这样通过..
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05', '2013-09'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
使用 loc
方法根据索引开始和结束日期进行切片..
Start = stats_month_census_2[stats_month_census_2['year'] =='2008-05'].index[0]
End = stats_month_census_2[stats_month_census_2['year']=='2013-09'].index[0]
>>> stats_month_census_2.loc[Start:End]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
注意: 出于好奇,正如@jezrael 在评论中提出的那样,我添加了如何将 year
列转换为日期时间格式:
因为我们有下面的示例 DataFrame,其中我们有两个不同的列 year
和 month
,其中年列只有年,月列是文字字符串格式 所以,首先我们需要转换将字符串转换为 int 形式,通过使用 pandas pd.to_datetime
方法将所有日期指定为 1,将年份和月份连接或相加。
df
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
以上是日期时间转换之前的原始 DataFrame 因此,我采用了以下我在 vi SO 本身中学到的方法。
1- 首先将 month
名称转换为 int 形式,并将其分配给一个名为 Month
的新列作为简单的操作所以,我们可以在以后使用它进行转换。
df['Month'] = pd.to_datetime(df.month, format='%B').dt.month
2- 其次,或者最后直接将年份列转换为适当的 datetime
格式,方法是直接分配给 year
列本身,我们可以说这是一种就地。
df['Date'] = pd.to_datetime(df[['year', 'Month']].assign(Day=1))
Now the Desired DataFrame and
year
column is in datetime Form:
print(df)
year month data Month
0 2008-04-01 April 1 4
1 2008-05-01 May 3 5
2 2008-06-01 June 4 6
3 2013-09-01 September 6 9
4 2013-10-01 October 5 10
5 2014-11-01 November 6 11
6 2014-12-01 December 7 12
另一个解决方案:
假设 df 如下所示:
series name Month Year
0 fertility rate May 2008
1 CO2 emissions June 2009
2 fertility rate September 2013
3 fertility rate October 2013
4 CO2 emissions December 2014
创建日历字典映射并保存在新列中
import calendar
d = dict((v,k) for k,v in enumerate(calendar.month_abbr))
stats_month_census_2['month_int'] = stats_month_census_2.Month.apply(lambda x: x[:3]).map(d)
>>stats_month_census_2
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
3 fertility rate October 2013 10
4 CO2 emissions December 2014 12
过滤器使用 series.between()
stats_month_census_2[stats_month_census_2.month_int.between(5,9,inclusive=True) & stats_month_census_2.Year.between(2008,2013,inclusive=True)]
输出:
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9