如何使用 python HTML 解析 URL 列表

How to HTML parse a URL list using python

我在名为 URLlist.txt.

的 .txt 文件中有一个包含 5 URL 的 URL 列表
https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp

我需要将5URL秒内的所有HTML内容一一解析,以便进一步处理。

我当前解析个人的代码URL -

import requests from bs4 
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)   
print(soup.prettify())

您的问题将通过 line-by-line 准备得到解决,然后将该行放入您的请求中。 样本:

import requests from bs4
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

f = open("URLlist.txt", "r")
for line in f:
    print(line) # CURRENT LINE
    r = requests.get(line)
    soup = bs(r.content)
    print(soup.prettify())

创建链接列表

with open('test.txt', 'r') as f:
    urls = [line.strip() for line in f]

然后你可以循环你的解析

for url in urls:
    r = requests.get(url)
    ...

您实现它的方式取决于您是否需要迭代处理 URL 或收集所有内容以进行后续处理是否更好。这就是我的建议。构建一个字典,其中每个键都是一个 URL,关联的值是页面中的文本 (HTML) return。使用多线程以提高效率。

import requests
from concurrent.futures import ThreadPoolExecutor

data = dict()

def readurl(url):
    try:
        (r := requests.get(url)).raise_for_status()
        data[url] = r.text
    except Exception:
        pass

def main():
    with open('urls.txt') as infile:
        with ThreadPoolExecutor() as executor:
            executor.map(readurl, map(str.strip, infile.readlines()))
    print(data)

if __name__ == '__main__':
    main()