如何使用字符串和整数将 pandas 列拆分为两列
How to split pandas column into two columns with strings and ints
我希望将日期范围列拆分为两列,即开始日期和结束日期。但是它拆分似乎不起作用,因为它不识别“-”。有什么建议吗?
我试过使用
'''
ebola1 = pd.DataFrame(ebola['Date range'].str.split('-',1).to_list(),columns = ['start date','end date'])
'''
但是,它 returns 以下内容:
所以 (1) 它无法识别“-”,(2) 我如何区分 'Jun-Nov 1976' 和 'Oct 2001-Mar 2002',(3) 我如何将新列包含在现有 table?
感谢您的帮助!
使用–
代替-
,所以使用Series.str.split
和expand=True
用于DataFrame
:
data = ['Jun–Nov 1976', 'Sep–Oct 1976', 'Jun 1977', 'Jul–Oct 1979', 'Nov 1994', 'Nov 1994–Feb 1995', 'Jan–Jul 1995', 'Jan–Mar 1996', 'Jul 1996–Jan 1997', 'Oct 2000–Feb 2001', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Oct–Dec 2003', 'Apr–Jun 2004']
ebola = pd.DataFrame(data, columns=['Date range'])
ebola1 = ebola['Date range'].str.split('–', 1, expand=True)
ebola1.columns = ['start date','end date']
然后numpy.where
for add years from end date
by Series.str.extract
but only if not exist in start date
column tested by Series.str.contains
:
mask = ebola1['start date'].str.contains('\d')
years = ebola1['end date'].str.extract('(\d+)', expand=False)
ebola1['start date'] = np.where(mask,
ebola1['start date'],
ebola1['start date'] + ' ' + years)
print (ebola1)
start date end date
0 Jun 1976 Nov 1976
1 Sep 1976 Oct 1976
2 Jun 1977 None
3 Jul 1979 Oct 1979
4 Nov 1994 None
5 Nov 1994 Feb 1995
6 Jan 1995 Jul 1995
7 Jan 1996 Mar 1996
8 Jul 1996 Jan 1997
9 Oct 2000 Feb 2001
10 Oct 2001 Mar 2002
11 Oct 2001 Mar 2002
12 Oct 2001 Mar 2002
13 Oct 2001 Mar 2002
14 Oct 2001 Mar 2002
15 Dec 2002 Apr 2003
16 Dec 2002 Apr 2003
17 Dec 2002 Apr 2003
18 Oct 2003 Dec 2003
19 Apr 2004 Jun 2004
我希望将日期范围列拆分为两列,即开始日期和结束日期。但是它拆分似乎不起作用,因为它不识别“-”。有什么建议吗?
我试过使用
''' ebola1 = pd.DataFrame(ebola['Date range'].str.split('-',1).to_list(),columns = ['start date','end date']) '''
但是,它 returns 以下内容:
所以 (1) 它无法识别“-”,(2) 我如何区分 'Jun-Nov 1976' 和 'Oct 2001-Mar 2002',(3) 我如何将新列包含在现有 table?
感谢您的帮助!
使用–
代替-
,所以使用Series.str.split
和expand=True
用于DataFrame
:
data = ['Jun–Nov 1976', 'Sep–Oct 1976', 'Jun 1977', 'Jul–Oct 1979', 'Nov 1994', 'Nov 1994–Feb 1995', 'Jan–Jul 1995', 'Jan–Mar 1996', 'Jul 1996–Jan 1997', 'Oct 2000–Feb 2001', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Oct–Dec 2003', 'Apr–Jun 2004']
ebola = pd.DataFrame(data, columns=['Date range'])
ebola1 = ebola['Date range'].str.split('–', 1, expand=True)
ebola1.columns = ['start date','end date']
然后numpy.where
for add years from end date
by Series.str.extract
but only if not exist in start date
column tested by Series.str.contains
:
mask = ebola1['start date'].str.contains('\d')
years = ebola1['end date'].str.extract('(\d+)', expand=False)
ebola1['start date'] = np.where(mask,
ebola1['start date'],
ebola1['start date'] + ' ' + years)
print (ebola1)
start date end date
0 Jun 1976 Nov 1976
1 Sep 1976 Oct 1976
2 Jun 1977 None
3 Jul 1979 Oct 1979
4 Nov 1994 None
5 Nov 1994 Feb 1995
6 Jan 1995 Jul 1995
7 Jan 1996 Mar 1996
8 Jul 1996 Jan 1997
9 Oct 2000 Feb 2001
10 Oct 2001 Mar 2002
11 Oct 2001 Mar 2002
12 Oct 2001 Mar 2002
13 Oct 2001 Mar 2002
14 Oct 2001 Mar 2002
15 Dec 2002 Apr 2003
16 Dec 2002 Apr 2003
17 Dec 2002 Apr 2003
18 Oct 2003 Dec 2003
19 Apr 2004 Jun 2004