使用 Python 从随机 url 自动下载 csv 文件

Question

在 Python 中使用 Pandas，我想从 this 网站下载一个 csv 文件，但下载 link 包含一些随机字符，所以我想知道如何使其自动化。

这是一个每天更新的金融交易数据。我要下载的 csv 文件是第一行红色方块中的文件。每天，顶部都会添加一个新行，我想自动下载此 csv。

我的计划是通过使用日期动态创建 url 字符串，自动将 csv 导入 Python 中的 pandas。 url 的示例如下所示：

https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/20200731_volume_by_participant_whole_day.csv

这是我的 Python 脚本。

from datetime import datetime as dt
day = dt.today()
date = str(day.year) + '{:02d}'.format(day.month) + '{:02d}'.format(day.day)
url = 'https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/%s_volume_by_participant_whole_day_J-NET.csv' %date
# Followed by pandas...

问题是，这个 url(nlsgeu000004vgi7-att) 的一部分总是随机的字符序列，我无法真正动态地查询它。比如7/30，这部分是nlsgeu000004vd5b-att。至少，我不知道生成这个字符串的规则是什么。

有什么方法可以正确指向这种部分随机的 url？我想到了一些解决方法，但不知道如何实际实施它们。如果你能帮助我，那就太好了！只要我能自动下载csv，什么办法都行！

使用正则表达式
使用诸如 BeautifulSoup 之类的抓取工具来获取第一行中任何 csv 的 url

Answer 1

我会按照您的建议抓取网站。这似乎很容易做到这一点（只要这些元素不是使用 javascript 动态生成的），并且如果您错误地假设 url 模式，将消除未来可能出现的正则表达式问题:

使用 GET 请求从页面中提取 html（使用 requests）
使用BeautifulSoup提取你想要的url

Answer 2

是的，如果您不知道 url 是如何生成的，您需要抓取页面才能找到它。这是一个使用 BeautifulSoup 和正则表达式过滤器的快速示例，以查找该页面上第一个 link，其中包含 url:

中的 volume_by_participant_whole_day.csv

import re
import requests
from bs4 import BeautifulSoup

base_url = "https://www.jpx.co.jp"
data = requests.get(f"{base_url}/markets/derivatives/participant-volume/archives-01.html")
parsed = BeautifulSoup(data.text, "html.parser")
link = parsed.find("a", href=re.compile("volume_by_participant_whole_day.csv"))
path = link["href"]
print(f"{base_url}{path}")

Answer 3

我写了一些代码，这将直接获取特定 excel 文件的 link。我没有使用任何正则表达式，我的答案是基于那个元素的位置，所以你可以通过运行得到 link 它。

在运行代码之前确保您有请求和 BeautifulSoup 模块

如果没有，这些是安装说明

# for requests module
pip install requests

# for beautifulsoup module
pip install beautifulsoup4

BS 脚本

# Imports
import requests
from bs4 import BeautifulSoup as bs

# Requesting and extracting html code
html_source = requests.get('https://www.jpx.co.jp/markets/derivatives/participant-volume/archives-01.html').text

# converting html to bs4 object
soup = bs(html_source, 'html.parser')

# finding all the table rows columns
trs = soup.find_all('tr')

# selecting 3rd row
x = [i for i in trs[2]]

# selecting 4th cell and then 2nd item(1st item is the pdf one)
y = [i for i in x[7]][2]

excel_file_link = y.get('href')

print(excel_file_link)

使用 Python 从随机 url 自动下载 csv 文件

automate a downloading of csv file from a random url with Python

python

csv

url

download

pandas