当函数生成单个 link 时,无法以正确的方式使用 ThreadPoolExecutor
Can't use ThreadPoolExecutor in the right way when a function produces a single link
我使用 concurrent.futures 库创建了一个脚本来执行多线程,以便更快地执行脚本。如果脚本中的第一个函数 get_content_url()
产生了多个 link,则当前的实现会起作用。但是,由于第一个函数生成单个 link,我不明白在这种情况下如何使用 concurrent.futures。
为了让您了解第一个函数在做什么 - 当我从 csv 文件向此函数 get_content_url()
提供 id 时,它会使用从 [=] 收集的令牌生成单个 link 24=]响应。
How can I apply concurrent.futures
within the script in the right way to make the execution faster?
我试过:
import requests
import concurrent.futures
from bs4 import BeautifulSoup
base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"
def get_content_url(item_id):
r = requests.get(base_link.format(item_id['id']))
token = r.json()['token']
content_url = target_link.format(token)
yield content_url
def get_content(target_link):
r = requests.get(target_link)
soup = BeautifulSoup(r.text,"html.parser")
try:
title = soup.select_one("h1#maintitle").get_text(strip=True)
except Exception: title = ""
print(title)
if __name__ == '__main__':
with open("IDS.csv","r") as f:
reader = csv.DictReader(f)
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
for _id in reader:
future_to_url = {executor.submit(get_content,item): item for item in get_content_url(_id)}
concurrent.futures.as_completed(future_to_url)
这可能有点难以重现,因为我不知道 IDS.csv
里面有什么,而且你的问题中缺少有效的 url 案例,但这里有一些可以玩的东西:
import csv
import random
import requests
import concurrent.futures
from bs4 import BeautifulSoup
base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"
def get_content_url(item_id):
url = base_link.format(item_id)
print(f"Requesting {url}...")
token = requests.get(url).json()['token']
return target_link.format(token)
def get_content(item_id):
url = get_content_url(item_id)
print(f"Fetching {url}...")
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
try:
title = soup.select_one("h1#maintitle").get_text(strip=True)
return title
except Exception as exc:
return exc
def write_fake_ids():
fake_ids = [
{"item": "sample_item", "item_id": _} for _ in
random.sample(range(1000, 10001), 1000)
]
with open("IDS.csv", "w") as output:
w = csv.DictWriter(output, fieldnames=fake_ids[0].keys())
w.writeheader()
w.writerows(fake_ids)
def get_ids():
with open("IDS.csv") as csv_file:
ids = csv.DictReader(csv_file)
yield from (id_ for id_ in ids)
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
future_to_url = {
executor.submit(get_content, id_['item_id']): id_ for id_ in get_ids()
}
for future in concurrent.futures.as_completed(future_to_url):
print(future.result())
我正在用 write_fake_ids()
模拟 .csv 文件。您可以忽略它或删除它,它不会在代码中的任何地方被调用。
我使用 concurrent.futures 库创建了一个脚本来执行多线程,以便更快地执行脚本。如果脚本中的第一个函数 get_content_url()
产生了多个 link,则当前的实现会起作用。但是,由于第一个函数生成单个 link,我不明白在这种情况下如何使用 concurrent.futures。
为了让您了解第一个函数在做什么 - 当我从 csv 文件向此函数 get_content_url()
提供 id 时,它会使用从 [=] 收集的令牌生成单个 link 24=]响应。
How can I apply
concurrent.futures
within the script in the right way to make the execution faster?
我试过:
import requests
import concurrent.futures
from bs4 import BeautifulSoup
base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"
def get_content_url(item_id):
r = requests.get(base_link.format(item_id['id']))
token = r.json()['token']
content_url = target_link.format(token)
yield content_url
def get_content(target_link):
r = requests.get(target_link)
soup = BeautifulSoup(r.text,"html.parser")
try:
title = soup.select_one("h1#maintitle").get_text(strip=True)
except Exception: title = ""
print(title)
if __name__ == '__main__':
with open("IDS.csv","r") as f:
reader = csv.DictReader(f)
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
for _id in reader:
future_to_url = {executor.submit(get_content,item): item for item in get_content_url(_id)}
concurrent.futures.as_completed(future_to_url)
这可能有点难以重现,因为我不知道 IDS.csv
里面有什么,而且你的问题中缺少有效的 url 案例,但这里有一些可以玩的东西:
import csv
import random
import requests
import concurrent.futures
from bs4 import BeautifulSoup
base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"
def get_content_url(item_id):
url = base_link.format(item_id)
print(f"Requesting {url}...")
token = requests.get(url).json()['token']
return target_link.format(token)
def get_content(item_id):
url = get_content_url(item_id)
print(f"Fetching {url}...")
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
try:
title = soup.select_one("h1#maintitle").get_text(strip=True)
return title
except Exception as exc:
return exc
def write_fake_ids():
fake_ids = [
{"item": "sample_item", "item_id": _} for _ in
random.sample(range(1000, 10001), 1000)
]
with open("IDS.csv", "w") as output:
w = csv.DictWriter(output, fieldnames=fake_ids[0].keys())
w.writeheader()
w.writerows(fake_ids)
def get_ids():
with open("IDS.csv") as csv_file:
ids = csv.DictReader(csv_file)
yield from (id_ for id_ in ids)
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
future_to_url = {
executor.submit(get_content, id_['item_id']): id_ for id_ in get_ids()
}
for future in concurrent.futures.as_completed(future_to_url):
print(future.result())
我正在用 write_fake_ids()
模拟 .csv 文件。您可以忽略它或删除它,它不会在代码中的任何地方被调用。