如何在列表的数据框中查找具有最常见元素的行?
How to find rows with most common elements in a dataframe of lists?
我有一个艺术家数据框,每个艺术家都有一个与其相关的流派列表
Artist Genres
0 A ['Pop','Dance Pop']
1 B ['Rock, Rock n Roll']
2 C ['Electronic]
3 D ['Pop', 'Dance Pop', 'Electro Pop']
4 E ['Pop']
5 F ['Dance Pop']
我想做一个艺术家推荐系统,基本上给定一个艺术家,其他艺术家与他们相似,按常见流派的数量排名。
例如,我想找到与 A 相似的内容,我想要一个 returns 新数据帧的输出,例如:
Similar Artist to A Similar Genres
D ['Pop','Dance Pop']
E ['Pop']
F ['Dance Pop']
有谁知道解决这个问题的方法吗?
import pandas as pd
def rank_artist_similarity(data, artist):
artist_data = data[data.Artist == artist]
artist_genres = set(*artist_data.Genres)
similarity_data = data.drop(artist_data.index)
similarity_data.Genres = similarity_data.Genres.apply(lambda genres: list(set(genres).intersection(artist_genres)))
similarity_lengths = similarity_data.Genres.str.len()
similarity_data = similarity_data.reindex(similarity_lengths[similarity_lengths > 0].sort_values(ascending=False).index)
similarity_data.rename({'Artist': f'Similar Artist to {artist}', 'Genres': 'Similar Genres'}, inplace=True)
return similarity_data
df = pd.DataFrame({'Artist': ['A', 'B', 'C', 'D', 'E', 'F'], 'Genres': [['Pop','Dance Pop'], ['Rock, Rock n Roll'], ['Electronic'], ['Pop', 'Dance Pop', 'Electro Pop'], ['Pop'],['Dance Pop']]})
rank_artist_similarity(df, 'A')
Artist Genres
3 D [Pop, Dance Pop]
5 F [Dance Pop]
4 E [Pop]
试试这个,使用 explode
search_ = "A"
# extract Genres for the artist..
genres = df.loc[df.Artist == search_, 'Genres'][0]
# transform array of values to rows using explode.
df_explode = df.explode(column="Genres")
# apply .loc to filter out Artists matching genres.
artist = (
df_explode.loc[df_explode['Genres'].isin(genres), 'Artist'].unique().tolist()
)
print(df[df.Artist.isin(artist)])
Artist Genres
0 A [Pop, Dance Pop]
3 D [Pop, Dance Pop, Electro Pop]
4 E [Pop]
5 F [Dance Pop]
你也可以想象在没有 pandas 的情况下只对列表进行排序:
genres = [
('A', {'Dance Pop', 'Pop'}),
('B', {'Rock', 'Rock n Roll'}),
('C', {'Electronic'}),
('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
('E', {'Pop'}),
('F', {'Dance Pop'})
]
# function to compare a reference artist with the others
def compare_artists(ref):
(_, genres_ref) = ref
def compare_artists_with_ref(x):
(_, genres_x) = x
return len(genres_x.intersection(genres_ref))
return compare_artists_with_ref
# Sort your list based on this comparison function
print(sorted(genres, key=compare_artists(genres[0]), reverse=True))
你得到:
[('A', {'Dance Pop', 'Pop'}),
('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
('E', {'Pop'}),
('F', {'Dance Pop'}),
('B', {'Rock', 'Rock n Roll'}),
('C', {'Electronic'})]
你可以在apply中使用set intersection
df1 = df[df["Artist"] == "A"]["Genres"][0]
df2 = df[df["Genres"].apply(lambda x: True if set(x).intersection(set(df1)) else False)]
df2 = df2[df2["Artist"] != "A"]
只需将其创建为函数并将 Artist
("A") 作为 args
传递
我有一个艺术家数据框,每个艺术家都有一个与其相关的流派列表
Artist Genres
0 A ['Pop','Dance Pop']
1 B ['Rock, Rock n Roll']
2 C ['Electronic]
3 D ['Pop', 'Dance Pop', 'Electro Pop']
4 E ['Pop']
5 F ['Dance Pop']
我想做一个艺术家推荐系统,基本上给定一个艺术家,其他艺术家与他们相似,按常见流派的数量排名。
例如,我想找到与 A 相似的内容,我想要一个 returns 新数据帧的输出,例如:
Similar Artist to A Similar Genres
D ['Pop','Dance Pop']
E ['Pop']
F ['Dance Pop']
有谁知道解决这个问题的方法吗?
import pandas as pd
def rank_artist_similarity(data, artist):
artist_data = data[data.Artist == artist]
artist_genres = set(*artist_data.Genres)
similarity_data = data.drop(artist_data.index)
similarity_data.Genres = similarity_data.Genres.apply(lambda genres: list(set(genres).intersection(artist_genres)))
similarity_lengths = similarity_data.Genres.str.len()
similarity_data = similarity_data.reindex(similarity_lengths[similarity_lengths > 0].sort_values(ascending=False).index)
similarity_data.rename({'Artist': f'Similar Artist to {artist}', 'Genres': 'Similar Genres'}, inplace=True)
return similarity_data
df = pd.DataFrame({'Artist': ['A', 'B', 'C', 'D', 'E', 'F'], 'Genres': [['Pop','Dance Pop'], ['Rock, Rock n Roll'], ['Electronic'], ['Pop', 'Dance Pop', 'Electro Pop'], ['Pop'],['Dance Pop']]})
rank_artist_similarity(df, 'A')
Artist Genres
3 D [Pop, Dance Pop]
5 F [Dance Pop]
4 E [Pop]
试试这个,使用 explode
search_ = "A"
# extract Genres for the artist..
genres = df.loc[df.Artist == search_, 'Genres'][0]
# transform array of values to rows using explode.
df_explode = df.explode(column="Genres")
# apply .loc to filter out Artists matching genres.
artist = (
df_explode.loc[df_explode['Genres'].isin(genres), 'Artist'].unique().tolist()
)
print(df[df.Artist.isin(artist)])
Artist Genres
0 A [Pop, Dance Pop]
3 D [Pop, Dance Pop, Electro Pop]
4 E [Pop]
5 F [Dance Pop]
你也可以想象在没有 pandas 的情况下只对列表进行排序:
genres = [
('A', {'Dance Pop', 'Pop'}),
('B', {'Rock', 'Rock n Roll'}),
('C', {'Electronic'}),
('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
('E', {'Pop'}),
('F', {'Dance Pop'})
]
# function to compare a reference artist with the others
def compare_artists(ref):
(_, genres_ref) = ref
def compare_artists_with_ref(x):
(_, genres_x) = x
return len(genres_x.intersection(genres_ref))
return compare_artists_with_ref
# Sort your list based on this comparison function
print(sorted(genres, key=compare_artists(genres[0]), reverse=True))
你得到:
[('A', {'Dance Pop', 'Pop'}),
('D', {'Dance Pop', 'Electro Pop', 'Pop'}),
('E', {'Pop'}),
('F', {'Dance Pop'}),
('B', {'Rock', 'Rock n Roll'}),
('C', {'Electronic'})]
你可以在apply中使用set intersection
df1 = df[df["Artist"] == "A"]["Genres"][0]
df2 = df[df["Genres"].apply(lambda x: True if set(x).intersection(set(df1)) else False)]
df2 = df2[df2["Artist"] != "A"]
只需将其创建为函数并将 Artist
("A") 作为 args