从网络中删除字符 link 然后下载
remove characters from web link then download
我目前有大约 1000 个 Web link 到 excel 文件要下载。文档名称中没有任何模式,所以我刚刚抓取了所有网络 link,其中一些如下所示。
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx
其中一个主要问题是每个 link 的开头都有 VM300:1
,这不是 link 的一部分。我怎样才能从每个 link 开始删除这个 'VM300:1',大约有 1000link 秒,所以手动操作是不可行的。
修复该错误后,我下载文件的代码仍然不起作用?
这是我当前的代码:
import urllib2
urlfiles = ['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx']
urllib2.urlopen(urlfiles)
如有任何帮助,我们将不胜感激。
您可以根据 space </code> 拆分 url,如下所示:</p>
<pre><code>>>> urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
>>>
>>> urlfiles = [url.split()[1] for url in urls ]
>>> print(urlfiles)
['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx']
除此之外,您需要遍历 urlfiles
中的每个 url 才能像这样打开它:
>>> import urllib2
>>>
>>> for url in urlfiles:
... urllib2.urlopen(url)
如果您的链接字面上都以 'VM300:1 '
开头,则需要将其删除,您最好忽略前 8 个字符,甚至不用正则表达式。
至于下载所有这些文件,假设没有基于 cookie、会话等的限制并使用 Python 3:
import urllib.request
from pathlib import Path
urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
for url in urls:
urllib.request.urlretrieve(url=url[8:], filename=Path(url).name)
我听说了 python 但没有提到 requests
?
from pathlib import Path
import requests
urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
for url in urls:
link = url.split()[1]
r = requests.get(link)
with open(Path(link).name, 'wb') as f:
f.write(r.content)
我目前有大约 1000 个 Web link 到 excel 文件要下载。文档名称中没有任何模式,所以我刚刚抓取了所有网络 link,其中一些如下所示。
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx
其中一个主要问题是每个 link 的开头都有 VM300:1
,这不是 link 的一部分。我怎样才能从每个 link 开始删除这个 'VM300:1',大约有 1000link 秒,所以手动操作是不可行的。
修复该错误后,我下载文件的代码仍然不起作用?
这是我当前的代码:
import urllib2
urlfiles = ['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx']
urllib2.urlopen(urlfiles)
如有任何帮助,我们将不胜感激。
您可以根据 space </code> 拆分 url,如下所示:</p>
<pre><code>>>> urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
>>>
>>> urlfiles = [url.split()[1] for url in urls ]
>>> print(urlfiles)
['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx',
'https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx']
除此之外,您需要遍历 urlfiles
中的每个 url 才能像这样打开它:
>>> import urllib2
>>>
>>> for url in urlfiles:
... urllib2.urlopen(url)
如果您的链接字面上都以 'VM300:1 '
开头,则需要将其删除,您最好忽略前 8 个字符,甚至不用正则表达式。
至于下载所有这些文件,假设没有基于 cookie、会话等的限制并使用 Python 3:
import urllib.request
from pathlib import Path
urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
for url in urls:
urllib.request.urlretrieve(url=url[8:], filename=Path(url).name)
我听说了 python 但没有提到 requests
?
from pathlib import Path
import requests
urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
for url in urls:
link = url.split()[1]
r = requests.get(link)
with open(Path(link).name, 'wb') as f:
f.write(r.content)