Pandas - 连接多索引数据帧保持重复索引
Pandas - concating multi-indexed dataframes keeps duplicate indizes
你好,我正在尝试读取相同结构的多个 df,并将它们连接成一个,但是组合的 df 以某种方式在索引中保留了重复项....
df1 = processUSAdata(2223)
df2 = processUSAdata(2224)
print(df1)
print(df2)
test = pd.concat([df1,df2])
test.set_index(['state','year'],inplace=True)
print(test)
产出
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2223
1 ALASKA 731545 867.07 ... 2066.04 357.74 2223
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2223
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2223
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2223
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2223
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2223
[7 rows x 12 columns]
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2224
1 ALASKA 731545 867.07 ... 2066.04 357.74 2224
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2224
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2224
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2224
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2224
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2224
[7 rows x 12 columns]
population violent_crime murder ... burglary theft gta
state year ...
ALABAMA 2223 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2223 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2223 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2223 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2223 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 2.92 ... 180.66 1078.65 167.28
ALABAMA 2224 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2224 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2224 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2224 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2224 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2224 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2224 3565287 183.60 2.92 ... 180.66 1078.65 167.28
希望有人能帮助我。
提前致谢! :)
您可以通过以下方式验证何时串联重复项:
test = pd.concat([df1, df2], verify_integrity=True)
或者您可以在之后删除重复项:
test.set_index(['state', 'year'], inplace=True).drop_duplicates(inplace=True)
可能是因为你的索引没有排序:
test = pd.concat([df1, df2]).set_index(['state', 'year']).sort_index()
print(test)
# Output
population violent_crime theft gta
state year
ALABAMA 2223 4903185 510.81 1886.06 256.51
2224 4903185 510.81 1886.06 256.51
ALASKA 2223 731545 867.07 2066.04 357.74
2224 731545 867.07 2066.04 357.74
ARIZONA 2223 7278717 455.31 1796.86 249.37
2224 7278717 455.31 1796.86 249.37
ARKANSAS 2223 3017804 584.63 2012.56 245.87
2224 3017804 584.63 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 1586.35 358.77
2224 39512223 441.21 1586.35 358.77
COLORADO 2223 5758736 380.95 1858.26 383.99
2224 5758736 380.95 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 1078.65 167.28
2224 3565287 183.60 1078.65 167.28
你好,我正在尝试读取相同结构的多个 df,并将它们连接成一个,但是组合的 df 以某种方式在索引中保留了重复项....
df1 = processUSAdata(2223)
df2 = processUSAdata(2224)
print(df1)
print(df2)
test = pd.concat([df1,df2])
test.set_index(['state','year'],inplace=True)
print(test)
产出
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2223
1 ALASKA 731545 867.07 ... 2066.04 357.74 2223
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2223
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2223
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2223
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2223
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2223
[7 rows x 12 columns]
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2224
1 ALASKA 731545 867.07 ... 2066.04 357.74 2224
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2224
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2224
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2224
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2224
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2224
[7 rows x 12 columns]
population violent_crime murder ... burglary theft gta
state year ...
ALABAMA 2223 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2223 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2223 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2223 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2223 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 2.92 ... 180.66 1078.65 167.28
ALABAMA 2224 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2224 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2224 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2224 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2224 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2224 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2224 3565287 183.60 2.92 ... 180.66 1078.65 167.28
希望有人能帮助我。 提前致谢! :)
您可以通过以下方式验证何时串联重复项:
test = pd.concat([df1, df2], verify_integrity=True)
或者您可以在之后删除重复项:
test.set_index(['state', 'year'], inplace=True).drop_duplicates(inplace=True)
可能是因为你的索引没有排序:
test = pd.concat([df1, df2]).set_index(['state', 'year']).sort_index()
print(test)
# Output
population violent_crime theft gta
state year
ALABAMA 2223 4903185 510.81 1886.06 256.51
2224 4903185 510.81 1886.06 256.51
ALASKA 2223 731545 867.07 2066.04 357.74
2224 731545 867.07 2066.04 357.74
ARIZONA 2223 7278717 455.31 1796.86 249.37
2224 7278717 455.31 1796.86 249.37
ARKANSAS 2223 3017804 584.63 2012.56 245.87
2224 3017804 584.63 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 1586.35 358.77
2224 39512223 441.21 1586.35 358.77
COLORADO 2223 5758736 380.95 1858.26 383.99
2224 5758736 380.95 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 1078.65 167.28
2224 3565287 183.60 1078.65 167.28