col.drop_duplicates() 有没有更快的替代方法?
Is there any faster alternative to col.drop_duplicates()?
我正在尝试删除数据框 (csv) 中的重复数据并获取单独的 csv 以显示每列的唯一答案。问题是我的代码已经 运行 一天了(准确地说是 22 小时)我愿意接受其他一些建议。
我的数据有大约 20,000 行 headers。我之前试过像 df[col].unique() 一样一个一个地检查唯一列表,但没花那么长时间。
>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)
我期望的是具有相同列集但每个字段中没有任何重复的数据框。前任。如果 df['Rmoisture'] 有 Yes,No,Nan 的组合,它应该只有这 3 个包含在另一个数据帧 df_uni.
的同一列中
如果列中值的顺序不重要,将每列转换为 set
以删除重复项,然后转换为 Series
并通过 concat
:
连接在一起
df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
如果顺序很重要:
df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
性能 2k 行中的 1k 个唯一值:
np.random.seed(2019)
#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)
In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
性能 2k 行中的 100 个唯一值
df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)
In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
另一种方法:
new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)
Mass Length Material Special Mark Special Num Breaking \
0 4.0 5.500000 Wood A 20.0 Yes
1 12.0 2.600000 Steel NaN NaN No
2 1.0 3.500000 Rubber B 5.5 NaN
3 15.0 6.500000 Plastic X 6.6 NaN
4 6.0 12.000000 NaN NaN 5.6 NaN
5 14.0 2.500000 NaN NaN 6.3 NaN
6 2.0 15.000000 NaN NaN NaN NaN
7 8.0 2.000000 NaN NaN NaN NaN
8 7.0 10.000000 NaN NaN NaN NaN
9 9.0 2.200000 NaN NaN NaN NaN
10 11.0 4.333333 NaN NaN NaN NaN
11 13.0 4.666667 NaN NaN NaN NaN
12 NaN 3.750000 NaN NaN NaN NaN
13 NaN 1.666667 NaN NaN NaN NaN
Comment
0 There is no heat
1 NaN
2 Contains moisture
3 Hit the table instead
4 A sign of wind
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
我正在尝试删除数据框 (csv) 中的重复数据并获取单独的 csv 以显示每列的唯一答案。问题是我的代码已经 运行 一天了(准确地说是 22 小时)我愿意接受其他一些建议。
我的数据有大约 20,000 行 headers。我之前试过像 df[col].unique() 一样一个一个地检查唯一列表,但没花那么长时间。
>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)
我期望的是具有相同列集但每个字段中没有任何重复的数据框。前任。如果 df['Rmoisture'] 有 Yes,No,Nan 的组合,它应该只有这 3 个包含在另一个数据帧 df_uni.
的同一列中如果列中值的顺序不重要,将每列转换为 set
以删除重复项,然后转换为 Series
并通过 concat
:
df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
如果顺序很重要:
df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
性能 2k 行中的 1k 个唯一值:
np.random.seed(2019)
#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)
In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
性能 2k 行中的 100 个唯一值
df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)
In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
另一种方法:
new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)
Mass Length Material Special Mark Special Num Breaking \
0 4.0 5.500000 Wood A 20.0 Yes
1 12.0 2.600000 Steel NaN NaN No
2 1.0 3.500000 Rubber B 5.5 NaN
3 15.0 6.500000 Plastic X 6.6 NaN
4 6.0 12.000000 NaN NaN 5.6 NaN
5 14.0 2.500000 NaN NaN 6.3 NaN
6 2.0 15.000000 NaN NaN NaN NaN
7 8.0 2.000000 NaN NaN NaN NaN
8 7.0 10.000000 NaN NaN NaN NaN
9 9.0 2.200000 NaN NaN NaN NaN
10 11.0 4.333333 NaN NaN NaN NaN
11 13.0 4.666667 NaN NaN NaN NaN
12 NaN 3.750000 NaN NaN NaN NaN
13 NaN 1.666667 NaN NaN NaN NaN
Comment
0 There is no heat
1 NaN
2 Contains moisture
3 Hit the table instead
4 A sign of wind
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN