考虑各种列条件对独特元素进行分类和计数
classifying and counting unique elements considering various column conditions
您好,我正在使用 python 对一些数据进行分类:
Articles Filename
A New Marine Ascomycete from Brunei. Invasive Species.csv
A new genus and four new species Forestry.csv
A new genus and four new species Invasive Species.csv
我想知道每个“文件名”有多少个独特的“文章”。
所以我想要的输出是这样的:
Filename Count_Unique
Invasive Species.csv 1
Forestry.csv 0
另一件事,我也想得到这个输出:
Filename1 Filename2 Count_Common articles
Forestry.csv Invasive Species.csv 1
我连接了数据集,最后统计了每个“文件名”中存在的元素。
有谁愿意帮忙吗?我试过 unique(), drop_duplicates()
等,但似乎无法获得我想要的输出。
无论如何,这是我代码的最后几行:
concatenated = pd.concat(data, ignore_index =True)
concatenatedconcatenated.groupby(['Title','Filename']).count().reset_index()
res = {col:concatenated[col].value_counts() for col in concatenated.columns}
res ['Filename']
没有魔法。只是一些常规操作。
(1) 统计文件中的“独特”文章
编辑:添加了 (quick-and-dirty) 代码以包含带有 zero-counts
的文件名
# prevent repetitive counting
df = df.drop_duplicates()
# articles to be removed (the ones appeared more than once)
dup_articles = df["Articles"].value_counts()
dup_articles = dup_articles[dup_articles > 1].index
# remove duplicate articles and count
mask_dup_articles = df["Articles"].isin(dup_articles)
df_unique = df[~mask_dup_articles]
df_unique["Filename"].value_counts()
# N.B. all filenames not shown here of course has 0 count.
# I will add this part later on.
Out[68]:
Invasive Species.csv 1
Name: Filename, dtype: int64
# unique article count with zeros
df_unique_nonzero_count = df_unique["Filename"].value_counts().to_frame().reset_index()
df_unique_nonzero_count.columns = ["Filename", "count"]
df_all_filenames = pd.DataFrame(
data={"Filename": df["Filename"].unique()}
)
# join: all filenames with counted filenames
df_unique_count = df_all_filenames.merge(df_unique_nonzero_count, on="Filename", how="outer")
# postprocess
df_unique_count.fillna(0, inplace=True)
df_unique_count["count"] = df_unique_count["count"].astype(int)
# print
df_unique_count
Out[119]:
Filename count
0 Invasive Species.csv 1
1 Forestry.csv 0
(2) 统计文件间的共同文章
# pick out records containing duplicate articles
df_dup = df[mask_dup_articles]
# merge on articles and then discard self- and duplicate pairs
df_merge = df_dup.merge(df_dup, on=["Articles"], suffixes=("1", "2"))
df_merge = df_merge[df_merge["Filename1"] > df_merge["Filename2"]] # alphabetical ordering
# count
df_ans2 = df_merge.groupby(["Filename1", "Filename2"]).count()
df_ans2.reset_index(inplace=True) # optional
df_ans2
Out[70]:
Filename1 Filename2 Articles
0 Invasive Species.csv Forestry.csv 1
您好,我正在使用 python 对一些数据进行分类:
Articles Filename
A New Marine Ascomycete from Brunei. Invasive Species.csv
A new genus and four new species Forestry.csv
A new genus and four new species Invasive Species.csv
我想知道每个“文件名”有多少个独特的“文章”。
所以我想要的输出是这样的:
Filename Count_Unique
Invasive Species.csv 1
Forestry.csv 0
另一件事,我也想得到这个输出:
Filename1 Filename2 Count_Common articles
Forestry.csv Invasive Species.csv 1
我连接了数据集,最后统计了每个“文件名”中存在的元素。
有谁愿意帮忙吗?我试过 unique(), drop_duplicates()
等,但似乎无法获得我想要的输出。
无论如何,这是我代码的最后几行:
concatenated = pd.concat(data, ignore_index =True)
concatenatedconcatenated.groupby(['Title','Filename']).count().reset_index()
res = {col:concatenated[col].value_counts() for col in concatenated.columns}
res ['Filename']
没有魔法。只是一些常规操作。
(1) 统计文件中的“独特”文章
编辑:添加了 (quick-and-dirty) 代码以包含带有 zero-counts
的文件名# prevent repetitive counting
df = df.drop_duplicates()
# articles to be removed (the ones appeared more than once)
dup_articles = df["Articles"].value_counts()
dup_articles = dup_articles[dup_articles > 1].index
# remove duplicate articles and count
mask_dup_articles = df["Articles"].isin(dup_articles)
df_unique = df[~mask_dup_articles]
df_unique["Filename"].value_counts()
# N.B. all filenames not shown here of course has 0 count.
# I will add this part later on.
Out[68]:
Invasive Species.csv 1
Name: Filename, dtype: int64
# unique article count with zeros
df_unique_nonzero_count = df_unique["Filename"].value_counts().to_frame().reset_index()
df_unique_nonzero_count.columns = ["Filename", "count"]
df_all_filenames = pd.DataFrame(
data={"Filename": df["Filename"].unique()}
)
# join: all filenames with counted filenames
df_unique_count = df_all_filenames.merge(df_unique_nonzero_count, on="Filename", how="outer")
# postprocess
df_unique_count.fillna(0, inplace=True)
df_unique_count["count"] = df_unique_count["count"].astype(int)
# print
df_unique_count
Out[119]:
Filename count
0 Invasive Species.csv 1
1 Forestry.csv 0
(2) 统计文件间的共同文章
# pick out records containing duplicate articles
df_dup = df[mask_dup_articles]
# merge on articles and then discard self- and duplicate pairs
df_merge = df_dup.merge(df_dup, on=["Articles"], suffixes=("1", "2"))
df_merge = df_merge[df_merge["Filename1"] > df_merge["Filename2"]] # alphabetical ordering
# count
df_ans2 = df_merge.groupby(["Filename1", "Filename2"]).count()
df_ans2.reset_index(inplace=True) # optional
df_ans2
Out[70]:
Filename1 Filename2 Articles
0 Invasive Species.csv Forestry.csv 1