如何在列表的数据框中查找具有最常见元素的行?

How to find rows with most common elements in a dataframe of lists?

我有一个艺术家数据框,每个艺术家都有一个与其相关的流派列表

    Artist         Genres             
0     A      ['Pop','Dance Pop']
1     B      ['Rock, Rock n Roll']
2     C      ['Electronic]
3     D      ['Pop', 'Dance Pop', 'Electro Pop']
4     E      ['Pop']
5     F      ['Dance Pop']

我想做一个艺术家推荐系统,基本上给定一个艺术家,其他艺术家与他们相似,按常见流派的数量排名。

例如,我想找到与 A 相似的内容,我想要一个 returns 新数据帧的输出,例如:

Similar Artist to A      Similar Genres
         D             ['Pop','Dance Pop']
         E                   ['Pop']
         F                ['Dance Pop']

有谁知道解决这个问题的方法吗?

import pandas as pd

def rank_artist_similarity(data, artist):
    artist_data = data[data.Artist == artist]
    artist_genres = set(*artist_data.Genres)
    similarity_data = data.drop(artist_data.index)
    similarity_data.Genres = similarity_data.Genres.apply(lambda genres: list(set(genres).intersection(artist_genres)))
    similarity_lengths = similarity_data.Genres.str.len()
    similarity_data = similarity_data.reindex(similarity_lengths[similarity_lengths > 0].sort_values(ascending=False).index)
    similarity_data.rename({'Artist': f'Similar Artist to {artist}', 'Genres': 'Similar Genres'}, inplace=True)
    return similarity_data

df = pd.DataFrame({'Artist': ['A', 'B', 'C', 'D', 'E', 'F'], 'Genres': [['Pop','Dance Pop'], ['Rock, Rock n Roll'], ['Electronic'], ['Pop', 'Dance Pop', 'Electro Pop'], ['Pop'],['Dance Pop']]})

rank_artist_similarity(df, 'A')
  Artist            Genres
3      D  [Pop, Dance Pop]
5      F       [Dance Pop]
4      E             [Pop]

试试这个,使用 explode

search_ = "A"

# extract Genres for the artist..
genres = df.loc[df.Artist == search_, 'Genres'][0]

# transform array of values to rows using explode.
df_explode = df.explode(column="Genres")

# apply .loc to filter out Artists matching genres.
artist = (
    df_explode.loc[df_explode['Genres'].isin(genres), 'Artist'].unique().tolist()
)

print(df[df.Artist.isin(artist)])

  Artist                         Genres
0      A               [Pop, Dance Pop]
3      D  [Pop, Dance Pop, Electro Pop]
4      E                          [Pop]
5      F                    [Dance Pop]

你也可以想象在没有 pandas 的情况下只对列表进行排序:

genres = [
 ('A', {'Dance Pop', 'Pop'}),
 ('B', {'Rock', 'Rock n Roll'}),
 ('C', {'Electronic'}),
 ('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
 ('E', {'Pop'}),
 ('F', {'Dance Pop'})
]

# function to compare a reference artist with the others
def compare_artists(ref):
    (_, genres_ref) = ref
    def compare_artists_with_ref(x):
        (_, genres_x) = x
        return len(genres_x.intersection(genres_ref))
    return compare_artists_with_ref

# Sort your list based on this comparison function
print(sorted(genres, key=compare_artists(genres[0]), reverse=True))

你得到:

[('A', {'Dance Pop', 'Pop'}),
 ('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
 ('E', {'Pop'}),
 ('F', {'Dance Pop'}),
 ('B', {'Rock', 'Rock n Roll'}),
 ('C', {'Electronic'})]

你可以在apply中使用set intersection

df1 = df[df["Artist"] == "A"]["Genres"][0]
df2 = df[df["Genres"].apply(lambda x: True if set(x).intersection(set(df1)) else False)]
df2 = df2[df2["Artist"] != "A"]

只需将其创建为函数并将 Artist("A") 作为 args

传递