使用熊猫添加缺失的行
Adding missing rows using panda
这个问题与有关,但有点复杂。
我有一个 table 这样的:
ID DEGREE TERM STATUS GRADTERM
1 Bachelors 20111 1
1 Bachelors 20116 1
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20141 1
3 Bachelors 20106 1
3 Bachelors 20111 1 20116
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20131 1 20136
我想把它变成这样(当 运行 为 20151 学期时):
ID DEGREE TERM STATUS
1 Bachelors 20111 1
1 Bachelors 20116 1
1 Bachelors 20121 0
1 Bachelors 20126 0
1 Bachelors 20131 0
1 Bachelors 20136 0
1 Bachelors 20141 0
1 Bachelors 20146 0
1 Bachelors 20151 0
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20136 0
2 Bachelors 20141 1
2 Bachelors 20146 0
2 Bachelors 20151 0
3 Bachelors 20106 1
3 Bachelors 20111 1
3 Bachelors 20116 2
3 Bachelors 20121 2
3 Bachelors 20126 2
3 Bachelors 20131 2
3 Bachelors 20136 2
3 Bachelors 20141 2
3 Bachelors 20146 2
3 Bachelors 20151 2
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20126 0
3 Masters 20131 1
3 Masters 20136 2
3 Masters 20141 2
3 Masters 20146 2
3 Masters 20151 2
在每个 table 中,状态为 0 - 未注册、1 - 已注册和 2 - 已毕业。 TERM 字段是年份,后跟 1 或 6 表示 spring 或秋季。
应该为每个人在他们的第一条记录和当前学期(在本例中为 20151)之间添加缺失的 TERM 记录。对于每条添加的记录,分配 STATUS 为 0,除非最后一个现有记录的 STATUS 为 2(携带)。也就是说,一个人已注册 (STATUS=1) 或未注册(STATUS=0 或 2)。
我在 Python 中使用 pandas,但我是 Python 的新手。我一直在试图弄清楚 DataFrame 的索引是如何工作的,但在这一点上这完全是个谜。任何指导将不胜感激。
你可以这样做。
import pandas as pd
# python 3.4 used
import io
# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'
df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)
DEGREE TERM STATUS GRADTERM
ID
1 Bachelors 20111 1 NaN
1 Bachelors 20116 1 NaN
2 Bachelors 20126 1 NaN
2 Bachelors 20131 1 NaN
2 Bachelors 20141 1 NaN
3 Bachelors 20106 1 NaN
3 Bachelors 20111 1 20116
3 Masters 20116 1 NaN
3 Masters 20121 1 NaN
3 Masters 20131 1 20136
# two helper functions
# =========================================================
def build_year_term_range(start_term, current_term):
# assumes start_term current_term in format '20151' alike
start_year = int(start_term[:4]) # first four are year
start_term = int(start_term[-1]) # last four is term
current_year = int(current_term[:4])
current_term = int(current_term[-1])
# build a range
year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
term_rng = [1, 6] * int(len(year_rng) / 2)
year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
# check whether need to trim the first and last
if start_term == 6: # remove the first
year_term_rng = year_term_rng[1:]
if current_term == 1: # remove the last
year_term_rng = year_term_rng[:-1]
return year_term_rng
def my_apply_func(group, current_year_term=current_year_term):
# start of the record
start_year_term = str(group['TERM'].iloc[0]) # gives 2001
year_term_rng = build_year_term_range(start_year_term, current_year_term)
# manipulate the group
group = group.reset_index().set_index('TERM')
# use reindex to populate missing rows
group = group.reindex(year_term_rng)
# fillna ID/DEGREE same as previous
group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')
# fillna by 0 not enrolled (for now)
group['STATUS'] = group['STATUS'].fillna(0)
# shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
group['GRADTERM'] = group['GRADTERM'].shift(1)
# check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
# might seems non-trivial at first place :)
group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
# return only relevant columns
return group['STATUS']
# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()
Out[163]:
ID DEGREE TERM STATUS
0 1 Bachelors 20111 1
1 1 Bachelors 20116 1
2 1 Bachelors 20121 0
3 1 Bachelors 20126 0
4 1 Bachelors 20131 0
5 1 Bachelors 20136 0
6 1 Bachelors 20141 0
7 1 Bachelors 20146 0
8 1 Bachelors 20151 0
9 2 Bachelors 20126 1
10 2 Bachelors 20131 1
11 2 Bachelors 20136 0
12 2 Bachelors 20141 1
13 2 Bachelors 20146 0
14 2 Bachelors 20151 0
15 3 Bachelors 20106 1
16 3 Bachelors 20111 1
17 3 Bachelors 20116 2
18 3 Bachelors 20121 2
19 3 Bachelors 20126 2
20 3 Bachelors 20131 2
21 3 Bachelors 20136 2
22 3 Bachelors 20141 2
23 3 Bachelors 20146 2
24 3 Bachelors 20151 2
25 3 Masters 20116 1
26 3 Masters 20121 1
27 3 Masters 20126 0
28 3 Masters 20131 1
29 3 Masters 20136 2
30 3 Masters 20141 2
31 3 Masters 20146 2
32 3 Masters 20151 2
这个问题与
我有一个 table 这样的:
ID DEGREE TERM STATUS GRADTERM
1 Bachelors 20111 1
1 Bachelors 20116 1
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20141 1
3 Bachelors 20106 1
3 Bachelors 20111 1 20116
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20131 1 20136
我想把它变成这样(当 运行 为 20151 学期时):
ID DEGREE TERM STATUS
1 Bachelors 20111 1
1 Bachelors 20116 1
1 Bachelors 20121 0
1 Bachelors 20126 0
1 Bachelors 20131 0
1 Bachelors 20136 0
1 Bachelors 20141 0
1 Bachelors 20146 0
1 Bachelors 20151 0
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20136 0
2 Bachelors 20141 1
2 Bachelors 20146 0
2 Bachelors 20151 0
3 Bachelors 20106 1
3 Bachelors 20111 1
3 Bachelors 20116 2
3 Bachelors 20121 2
3 Bachelors 20126 2
3 Bachelors 20131 2
3 Bachelors 20136 2
3 Bachelors 20141 2
3 Bachelors 20146 2
3 Bachelors 20151 2
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20126 0
3 Masters 20131 1
3 Masters 20136 2
3 Masters 20141 2
3 Masters 20146 2
3 Masters 20151 2
在每个 table 中,状态为 0 - 未注册、1 - 已注册和 2 - 已毕业。 TERM 字段是年份,后跟 1 或 6 表示 spring 或秋季。
应该为每个人在他们的第一条记录和当前学期(在本例中为 20151)之间添加缺失的 TERM 记录。对于每条添加的记录,分配 STATUS 为 0,除非最后一个现有记录的 STATUS 为 2(携带)。也就是说,一个人已注册 (STATUS=1) 或未注册(STATUS=0 或 2)。
我在 Python 中使用 pandas,但我是 Python 的新手。我一直在试图弄清楚 DataFrame 的索引是如何工作的,但在这一点上这完全是个谜。任何指导将不胜感激。
你可以这样做。
import pandas as pd
# python 3.4 used
import io
# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'
df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)
DEGREE TERM STATUS GRADTERM
ID
1 Bachelors 20111 1 NaN
1 Bachelors 20116 1 NaN
2 Bachelors 20126 1 NaN
2 Bachelors 20131 1 NaN
2 Bachelors 20141 1 NaN
3 Bachelors 20106 1 NaN
3 Bachelors 20111 1 20116
3 Masters 20116 1 NaN
3 Masters 20121 1 NaN
3 Masters 20131 1 20136
# two helper functions
# =========================================================
def build_year_term_range(start_term, current_term):
# assumes start_term current_term in format '20151' alike
start_year = int(start_term[:4]) # first four are year
start_term = int(start_term[-1]) # last four is term
current_year = int(current_term[:4])
current_term = int(current_term[-1])
# build a range
year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
term_rng = [1, 6] * int(len(year_rng) / 2)
year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
# check whether need to trim the first and last
if start_term == 6: # remove the first
year_term_rng = year_term_rng[1:]
if current_term == 1: # remove the last
year_term_rng = year_term_rng[:-1]
return year_term_rng
def my_apply_func(group, current_year_term=current_year_term):
# start of the record
start_year_term = str(group['TERM'].iloc[0]) # gives 2001
year_term_rng = build_year_term_range(start_year_term, current_year_term)
# manipulate the group
group = group.reset_index().set_index('TERM')
# use reindex to populate missing rows
group = group.reindex(year_term_rng)
# fillna ID/DEGREE same as previous
group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')
# fillna by 0 not enrolled (for now)
group['STATUS'] = group['STATUS'].fillna(0)
# shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
group['GRADTERM'] = group['GRADTERM'].shift(1)
# check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
# might seems non-trivial at first place :)
group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
# return only relevant columns
return group['STATUS']
# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()
Out[163]:
ID DEGREE TERM STATUS
0 1 Bachelors 20111 1
1 1 Bachelors 20116 1
2 1 Bachelors 20121 0
3 1 Bachelors 20126 0
4 1 Bachelors 20131 0
5 1 Bachelors 20136 0
6 1 Bachelors 20141 0
7 1 Bachelors 20146 0
8 1 Bachelors 20151 0
9 2 Bachelors 20126 1
10 2 Bachelors 20131 1
11 2 Bachelors 20136 0
12 2 Bachelors 20141 1
13 2 Bachelors 20146 0
14 2 Bachelors 20151 0
15 3 Bachelors 20106 1
16 3 Bachelors 20111 1
17 3 Bachelors 20116 2
18 3 Bachelors 20121 2
19 3 Bachelors 20126 2
20 3 Bachelors 20131 2
21 3 Bachelors 20136 2
22 3 Bachelors 20141 2
23 3 Bachelors 20146 2
24 3 Bachelors 20151 2
25 3 Masters 20116 1
26 3 Masters 20121 1
27 3 Masters 20126 0
28 3 Masters 20131 1
29 3 Masters 20136 2
30 3 Masters 20141 2
31 3 Masters 20146 2
32 3 Masters 20151 2