线程时有什么方法可以用 BeautifulSoup 解析吗?
Any way to parse with BeautifulSoup while threading?
如何多线程解析我正在解析的链接?
基本上我是在寻找链接,然后一个一个地解析这些链接。
它正在这样做:
for link in links:
scrape_for_info(link)
链接包含:
https://www.xtip.co.uk/en/?r=bets/xtra&group=476641&game=312053910
https://www.xtip.co.uk/en/?r=bets/xtra&group=476381&game=312057618
...
https://www.xtip.co.uk/en/bets/xtra.html?group=477374&game=312057263
scrape_for_info(url) 看起来像这样:
def scrape_for_info(url):
scrape = CP_GetOdds(url)
for x in range(scrape.GameRange()):
sql_str = "INSERT INTO Scraped_Odds ('"
sql_str += str(scrape.Time()) + "', '"
sql_str += str(scrape.Text(x)) + "', '"
sql_str += str(scrape.HomeTeam()) + "', '"
sql_str += str(scrape.Odds1(x)) + "', '"
sql_str += str(scrape.Odds2(x)) + "', '"
sql_str += str(scrape.AwayTeam()) + "')"
cursor.execute(sql_str)
conn.commit()
我看到抓取网站的时候用到了线程,但主要是抓取用的,不是解析用的。
我希望有人能教我如何比现在更快地解析。当我查看实时赔率时,我必须尽快更新
用Python.
自动化无聊的东西中有一个很好的例子
https://automatetheboringstuff.com/chapter15/
基本上你需要使用 threading
模块为你的每个 url 创建一个不同的线程,然后等待它们完成。
import threading
def scrape_for_info(url):
scrape = CP_GetOdds(url)
for x in range(scrape.GameRange()):
sql_str = "INSERT INTO Scraped_Odds ('"
sql_str += str(scrape.Time()) + "', '"
sql_str += str(scrape.Text(x)) + "', '"
sql_str += str(scrape.HomeTeam()) + "', '"
sql_str += str(scrape.Odds1(x)) + "', '"
sql_str += str(scrape.Odds2(x)) + "', '"
sql_str += str(scrape.AwayTeam()) + "')"
cursor.execute(sql_str)
conn.commit()
# Create and start the Thread objects.
threads = []
for link in links:
thread = threading.Thread(target=scrape_for_info, args=(link))
threads.append(thread)
thread.start()
# Wait for all threads to end.
for thread in threads:
thread.join()
print('Done.')
与multiprocessing you could consider using a Queue。
通常您会创建两个作业,一个创建 url,一个使用它们。我们称它们为 creator
和 consumer
。我假设这里有任何信号量,称为 closing_condition
(例如使用 Value),您用来解析 url 并保存它们的方法称为 create_url_method
和 store_url
分别。
from multiprocessing import Queue, Value, Process
import queue
def creator(urls, closing_condition):
"""Parse page and put urls in given Queue."""
while (not closing_condition):
created_urls = create_url_method()
[urls.put(url) for url in created_urls]
def consumer(urls, closing_condition):
"""Consume urls in given Queue."""
while (not closing_condition):
try:
store_url(urls.get(timeout=1))
except queue.Empty:
pass
urls = Queue()
semaphore = Value('d', 0)
creators_number = 2
consumers_number = 2
creators = [
Process(target=creator, args=(urls, semaphore))
for i in range(creators_number)
]
consumers = [
Process(target=consumer, args=(urls, semaphore))
for i in range(consumers_number)
]
[p.start() for p in creators + consumer]
[p.join() for p in creators + consumer]
谢谢大家的回答!
以下方法成功了:
from multiprocessing import Pool
with Pool(10) as p:
p.map(scrape_for_info, links))
如何多线程解析我正在解析的链接?
基本上我是在寻找链接,然后一个一个地解析这些链接。
它正在这样做:
for link in links:
scrape_for_info(link)
链接包含:
https://www.xtip.co.uk/en/?r=bets/xtra&group=476641&game=312053910
https://www.xtip.co.uk/en/?r=bets/xtra&group=476381&game=312057618
...
https://www.xtip.co.uk/en/bets/xtra.html?group=477374&game=312057263
scrape_for_info(url) 看起来像这样:
def scrape_for_info(url):
scrape = CP_GetOdds(url)
for x in range(scrape.GameRange()):
sql_str = "INSERT INTO Scraped_Odds ('"
sql_str += str(scrape.Time()) + "', '"
sql_str += str(scrape.Text(x)) + "', '"
sql_str += str(scrape.HomeTeam()) + "', '"
sql_str += str(scrape.Odds1(x)) + "', '"
sql_str += str(scrape.Odds2(x)) + "', '"
sql_str += str(scrape.AwayTeam()) + "')"
cursor.execute(sql_str)
conn.commit()
我看到抓取网站的时候用到了线程,但主要是抓取用的,不是解析用的。
我希望有人能教我如何比现在更快地解析。当我查看实时赔率时,我必须尽快更新
用Python.
自动化无聊的东西中有一个很好的例子https://automatetheboringstuff.com/chapter15/
基本上你需要使用 threading
模块为你的每个 url 创建一个不同的线程,然后等待它们完成。
import threading
def scrape_for_info(url):
scrape = CP_GetOdds(url)
for x in range(scrape.GameRange()):
sql_str = "INSERT INTO Scraped_Odds ('"
sql_str += str(scrape.Time()) + "', '"
sql_str += str(scrape.Text(x)) + "', '"
sql_str += str(scrape.HomeTeam()) + "', '"
sql_str += str(scrape.Odds1(x)) + "', '"
sql_str += str(scrape.Odds2(x)) + "', '"
sql_str += str(scrape.AwayTeam()) + "')"
cursor.execute(sql_str)
conn.commit()
# Create and start the Thread objects.
threads = []
for link in links:
thread = threading.Thread(target=scrape_for_info, args=(link))
threads.append(thread)
thread.start()
# Wait for all threads to end.
for thread in threads:
thread.join()
print('Done.')
与multiprocessing you could consider using a Queue。
通常您会创建两个作业,一个创建 url,一个使用它们。我们称它们为 creator
和 consumer
。我假设这里有任何信号量,称为 closing_condition
(例如使用 Value),您用来解析 url 并保存它们的方法称为 create_url_method
和 store_url
分别。
from multiprocessing import Queue, Value, Process
import queue
def creator(urls, closing_condition):
"""Parse page and put urls in given Queue."""
while (not closing_condition):
created_urls = create_url_method()
[urls.put(url) for url in created_urls]
def consumer(urls, closing_condition):
"""Consume urls in given Queue."""
while (not closing_condition):
try:
store_url(urls.get(timeout=1))
except queue.Empty:
pass
urls = Queue()
semaphore = Value('d', 0)
creators_number = 2
consumers_number = 2
creators = [
Process(target=creator, args=(urls, semaphore))
for i in range(creators_number)
]
consumers = [
Process(target=consumer, args=(urls, semaphore))
for i in range(consumers_number)
]
[p.start() for p in creators + consumer]
[p.join() for p in creators + consumer]
谢谢大家的回答!
以下方法成功了:
from multiprocessing import Pool
with Pool(10) as p:
p.map(scrape_for_info, links))