网络抓取的困难
Difficulties with web scraping
我刚刚看到一篇名为 The 500 Greatest Songs of All Time 的文章,我想“哦,太酷了,我敢打赌他们还制作了一个我可以关注的 Spotify/Apple 音乐列表”。嗯...他们没有。
简而言之,我想知道是否可以 1) 删除网站以提取歌曲,以及 2) 然后批量上传到 Spotify 以创建列表。
歌曲的标题和作者在网站中的结构如下:
Website screenshot。我已经尝试在 google 表中使用 importxml() 公式删除网络,但没有成功。
我知道报废部分比其他部分更容易,因为我是编程新手,我很乐意设法部分实现这个目标。我相信这个任务可以在 python.
上轻松完成
我觉得解释一切都超出了这里的范围,所以我试着把代码注释得足够好。
1.抓取歌曲
我使用了 python3 和 selenium,他们的网站不会阻止它。
如有必要,请务必调整您的 chromedriver 路径 和底部的 .txt 文件 的输出路径。一旦完成并获得 .txt 文件,您就可以将其关闭。
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2。准备 Spotify
- 转到 Spotify Developer Dashboard 并创建一个
帐户(使用您的 Spotify 帐户)。
然后创建一个应用程序,随便起什么名字。
- 在您的应用上点击设置和白名单 http://localhost:8888/callback
- 在您的应用上点击“用户和访问”并添加您的 Spotify 帐户
- 让选项卡保持打开状态,我们会回来的
3。准备您的环境
你需要 Node.js 所以确保你的机器上安装了它
从 Spotifys 下载 this GitHub
解压,cd
进入文件夹,运行npm install
进入authorization_code文件夹并在编辑器中打开app.js
找到 var scope
并将“playlist-modify-public”附加到字符串,这样您的应用就可以访问您的 Spotify 播放列表,请参阅 here
现在回到您 Spotify Developer Dashboard 中的应用程序,我们需要将客户端 ID 和客户端密码分别复制到 var client_id
和 var client_secret
中(在 app.js 文件中)。 var redirect_uri
将会
http://localhost:8888/callback - 不要忘记保存您的更改。
4. 运行 Spotify 方面
cd
进入 authorization_code 文件夹和 运行 app.js 和 node app.js
(这基本上是您 PC 上的服务器 运行ning)
现在,如果可行,请离开它 运行ning 并转到 http://localhost:8888,在那里授权您的 Spotify 帐户
复制完整的token,包括overflow,使用inspect element获取
调整user_id
和auth
变量以及output_songs.txt的路径(打开) 在下面的python 脚本和运行 中,没有找到的歌曲将在最后打印出来,用Google 搜索。他们通常也在 Spotify 上,但 Google 似乎有更好的搜索算法(惊讶的皮卡丘脸)。
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist @{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
完成
如果您不想自己 运行 代码,播放列表如下:
https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
如果您遇到问题,还描述了从第 2 步开始的所有内容 here in the Web API quick start or in general in the web API docs。
关于 Apple Music
所以苹果似乎很封闭(惊喜哈哈)。不过我发现您可以查询 i-Tunes 商店。给出的响应还包含对 Apple 音乐上歌曲的直接 link。
你也许可以从那里去。
Get ISRC code from iTunes Search API (Apple music)
PS:不可否认,正则表达式是巫术,但你们支持我
我刚刚看到一篇名为 The 500 Greatest Songs of All Time 的文章,我想“哦,太酷了,我敢打赌他们还制作了一个我可以关注的 Spotify/Apple 音乐列表”。嗯...他们没有。
简而言之,我想知道是否可以 1) 删除网站以提取歌曲,以及 2) 然后批量上传到 Spotify 以创建列表。
歌曲的标题和作者在网站中的结构如下: Website screenshot。我已经尝试在 google 表中使用 importxml() 公式删除网络,但没有成功。
我知道报废部分比其他部分更容易,因为我是编程新手,我很乐意设法部分实现这个目标。我相信这个任务可以在 python.
上轻松完成我觉得解释一切都超出了这里的范围,所以我试着把代码注释得足够好。
1.抓取歌曲
我使用了 python3 和 selenium,他们的网站不会阻止它。 如有必要,请务必调整您的 chromedriver 路径 和底部的 .txt 文件 的输出路径。一旦完成并获得 .txt 文件,您就可以将其关闭。
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2。准备 Spotify
- 转到 Spotify Developer Dashboard 并创建一个 帐户(使用您的 Spotify 帐户)。 然后创建一个应用程序,随便起什么名字。
- 在您的应用上点击设置和白名单 http://localhost:8888/callback
- 在您的应用上点击“用户和访问”并添加您的 Spotify 帐户
- 让选项卡保持打开状态,我们会回来的
3。准备您的环境
你需要 Node.js 所以确保你的机器上安装了它
从 Spotifys 下载 this GitHub
解压,
cd
进入文件夹,运行npm install
进入authorization_code文件夹并在编辑器中打开app.js
找到
var scope
并将“playlist-modify-public”附加到字符串,这样您的应用就可以访问您的 Spotify 播放列表,请参阅 here现在回到您 Spotify Developer Dashboard 中的应用程序,我们需要将客户端 ID 和客户端密码分别复制到
var client_id
和var client_secret
中(在 app.js 文件中)。var redirect_uri
将会 http://localhost:8888/callback - 不要忘记保存您的更改。
4. 运行 Spotify 方面
cd
进入 authorization_code 文件夹和 运行 app.js 和node app.js
(这基本上是您 PC 上的服务器 运行ning)现在,如果可行,请离开它 运行ning 并转到 http://localhost:8888,在那里授权您的 Spotify 帐户
复制完整的token,包括overflow,使用inspect element获取
调整
user_id
和auth
变量以及output_songs.txt的路径(打开) 在下面的python 脚本和运行 中,没有找到的歌曲将在最后打印出来,用Google 搜索。他们通常也在 Spotify 上,但 Google 似乎有更好的搜索算法(惊讶的皮卡丘脸)。
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist @{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
完成
如果您不想自己 运行 代码,播放列表如下: https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
如果您遇到问题,还描述了从第 2 步开始的所有内容 here in the Web API quick start or in general in the web API docs。
关于 Apple Music
所以苹果似乎很封闭(惊喜哈哈)。不过我发现您可以查询 i-Tunes 商店。给出的响应还包含对 Apple 音乐上歌曲的直接 link。 你也许可以从那里去。
Get ISRC code from iTunes Search API (Apple music)
PS:不可否认,正则表达式是巫术,但你们支持我