如何使用 python HTML 解析 URL 列表
How to HTML parse a URL list using python
我在名为 URLlist.txt.
的 .txt 文件中有一个包含 5 URL 的 URL 列表
https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp
我需要将5URL秒内的所有HTML内容一一解析,以便进一步处理。
我当前解析个人的代码URL -
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)
print(soup.prettify())
您的问题将通过 line-by-line 准备得到解决,然后将该行放入您的请求中。
样本:
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
f = open("URLlist.txt", "r")
for line in f:
print(line) # CURRENT LINE
r = requests.get(line)
soup = bs(r.content)
print(soup.prettify())
创建链接列表
with open('test.txt', 'r') as f:
urls = [line.strip() for line in f]
然后你可以循环你的解析
for url in urls:
r = requests.get(url)
...
您实现它的方式取决于您是否需要迭代处理 URL 或收集所有内容以进行后续处理是否更好。这就是我的建议。构建一个字典,其中每个键都是一个 URL,关联的值是页面中的文本 (HTML) return。使用多线程以提高效率。
import requests
from concurrent.futures import ThreadPoolExecutor
data = dict()
def readurl(url):
try:
(r := requests.get(url)).raise_for_status()
data[url] = r.text
except Exception:
pass
def main():
with open('urls.txt') as infile:
with ThreadPoolExecutor() as executor:
executor.map(readurl, map(str.strip, infile.readlines()))
print(data)
if __name__ == '__main__':
main()
我在名为 URLlist.txt.
的 .txt 文件中有一个包含 5 URL 的 URL 列表https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp
我需要将5URL秒内的所有HTML内容一一解析,以便进一步处理。
我当前解析个人的代码URL -
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)
print(soup.prettify())
您的问题将通过 line-by-line 准备得到解决,然后将该行放入您的请求中。 样本:
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
f = open("URLlist.txt", "r")
for line in f:
print(line) # CURRENT LINE
r = requests.get(line)
soup = bs(r.content)
print(soup.prettify())
创建链接列表
with open('test.txt', 'r') as f:
urls = [line.strip() for line in f]
然后你可以循环你的解析
for url in urls:
r = requests.get(url)
...
您实现它的方式取决于您是否需要迭代处理 URL 或收集所有内容以进行后续处理是否更好。这就是我的建议。构建一个字典,其中每个键都是一个 URL,关联的值是页面中的文本 (HTML) return。使用多线程以提高效率。
import requests
from concurrent.futures import ThreadPoolExecutor
data = dict()
def readurl(url):
try:
(r := requests.get(url)).raise_for_status()
data[url] = r.text
except Exception:
pass
def main():
with open('urls.txt') as infile:
with ThreadPoolExecutor() as executor:
executor.map(readurl, map(str.strip, infile.readlines()))
print(data)
if __name__ == '__main__':
main()