使用 Python 从 URL 列表中查找特定 URL
Finding specific URLs from a list of URLs using Python
我想通过爬行查找特定链接是否存在于 URL 列表中。我已经编写了以下程序并且它运行良好。但是,我卡在了两个地方。
- 如何不使用数组,而是从文本文件中调用链接。
- 抓取工具需要将近 4 分钟的时间来抓取 100 个网页。
有什么方法可以让它更快。
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
import threading
start = time.time()
#Links I want to find
url = "example.com/one", "example.com/two", "example.com/three"]
#Links I want to find the above links in...
url_list =["example.com/1000", "example.com/1001", "example.com/1002",
"example.com/1003", "example.com/1004"]
print_lock = threading.Lock()
#with open("links.txt") as f:
# url_list1 = [url.strip() for url in f.readlines()]
def fetch_url(url):
for line1 in url_list:
print "Crawled" " " + line1
try:
html_page = urllib2.urlopen(line1)
soup = BeautifulSoup(html_page)
link = soup.findAll(href=True)
except urllib2.HTTPError:
pass
for link1 in link:
url1 = link1.get("href")
for url_input in url:
if url_input in url1:
with print_lock:
print 'Found' " " +url_input+ " " 'in'+ " " + line1
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('Entire job took:',time.time() - start)
如果您想从文本文件中读取,请使用您注释掉的代码。
至于 "performance" 问题:您的代码在读取操作 urlopen
处阻塞,直到返回网站内容。理想情况下,您希望 运行 并行处理这些请求。您需要一个并行化的解决方案,例如使用线程。
Here's 使用不同方法的示例,使用 gevent(非标准)
我想通过爬行查找特定链接是否存在于 URL 列表中。我已经编写了以下程序并且它运行良好。但是,我卡在了两个地方。
- 如何不使用数组,而是从文本文件中调用链接。
- 抓取工具需要将近 4 分钟的时间来抓取 100 个网页。
有什么方法可以让它更快。
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
import threading
start = time.time()
#Links I want to find
url = "example.com/one", "example.com/two", "example.com/three"]
#Links I want to find the above links in...
url_list =["example.com/1000", "example.com/1001", "example.com/1002",
"example.com/1003", "example.com/1004"]
print_lock = threading.Lock()
#with open("links.txt") as f:
# url_list1 = [url.strip() for url in f.readlines()]
def fetch_url(url):
for line1 in url_list:
print "Crawled" " " + line1
try:
html_page = urllib2.urlopen(line1)
soup = BeautifulSoup(html_page)
link = soup.findAll(href=True)
except urllib2.HTTPError:
pass
for link1 in link:
url1 = link1.get("href")
for url_input in url:
if url_input in url1:
with print_lock:
print 'Found' " " +url_input+ " " 'in'+ " " + line1
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('Entire job took:',time.time() - start)
如果您想从文本文件中读取,请使用您注释掉的代码。
至于 "performance" 问题:您的代码在读取操作 urlopen
处阻塞,直到返回网站内容。理想情况下,您希望 运行 并行处理这些请求。您需要一个并行化的解决方案,例如使用线程。
Here's 使用不同方法的示例,使用 gevent(非标准)