如何在 bs4 中抓取 <figure> 标签?
How would one scrape <figure> tags in bs4?
我正在尝试从 https://nytimes.com 中抓取图片,但是,他们网站上大多数主要标题的对应图片都存储在 <figure>
标签内,而不是 <img>
具有特定 src
属性的标签。
我如何才能抓取那些 <figure>
标签内图片的 URL,以便我能够将它们汇总到我自己的网站上?
由于 url 是动态的,您可以使用 selenium 和 BeautifulSoup.
获取主标题的所有图像 urls
from selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
data=[]
driver = webdriver.Chrome(ChromeDriverManager().install())
url='https://www.nytimes.com/'
driver.get(url)
driver.maximize_window()
soup=BeautifulSoup(driver.page_source,'html.parser')
driver.close()
for im in soup.select('.css-cov0u6 img'):
img=im.get('src')
data.append(img)
#print(img)
print(data)
输出:
https://static01.nyt.com/images/2022/04/14/multimedia/14musk-twitter/14musk-twitter-threeByTwoMediumAt2X-v2.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/nyregion/14nyshooting/merlin_205419441_07391422-eea0-4436-97e3-c253e755010a-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/climate/00virus-case-counts1/00virus-case-counts1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/01/world/00africa-france-4/merlin_188413827_06ae2d07-ecd5-4090-ba71-815f5faee66b-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14spiers-image/14spiers-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14reinhart-main/14reinhart-main-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/realestate/14HUNT-WINTHUR1/14HUNT-WINTHUR1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/world/14japan-toddlers1/14japan-toddlers1-threeByTwoMediumAt2X-v3.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/magazine/17mag-studies_01/17mag-studies_01-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/opinion/13coy-image/13coy-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/climate/12cli-newsletter-cup-still/12cli-newsletter-cup-still-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12krugman_newsletter_1/12krugman_newsletter_1-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12McWhorter-image/12McWhorter-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/climate/14cli-cactus1/14cli-cactus1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/00well-mental-apps/00well-mental-apps-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/12WELL-USPSTF-SCREENING2/merlin_181619943_befc32d6-5803-4885-9c6c-4369f50d80ae-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/10/well/06ASKWELL-ADHD1/06ASKWELL-ADHD1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2019/05/07/parenting/07-parenting-postpartumdep/07-parenting-postpartumdep-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2017/04/11/science/physed-breathing/physed-breathing-videoSixteenByNine1050.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/arts/17skarsgard1/17skarsgard1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/01/world/00Israel-Art01/merlin_202656597_dc718c26-d9ff-45c5-a300-90800c78ac10-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/fashion/14ASHLEY1/14ASHLEY1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/arts/13Fidelio-deaf-9/17INTIMACY-BALLET-9-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/dining/08Appe1/merlin_204759306_c259077b-a1ec-47ac-bb51-4113304a3282-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2019/04/18/homepage/spelling-bee-logo-bulletin/spelling-bee-logo-bulletin-square320-v5.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/02/crosswords/alpha-wordle-icon-new/alpha-wordle-icon-new-square320-v2.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/crossword-logo-nytgames-hires/crossword-logo-nytgames-hires-square320-v3.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/03/crosswords/nyt-games-homepage-playmodule-subscribe/nyt-games-homepage-playmodule-subscribe-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/05/27/multimedia/alpha-letterboxed-promo-1622145789727/alpha-letterboxed-promo-1622145789727-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/tiles-logo-nytgames-hi-res/tiles-logo-nytgames-hi-res-square320-v4.png?format=pjpg&quality=75&auto=webp&disable=upscale
我正在尝试从 https://nytimes.com 中抓取图片,但是,他们网站上大多数主要标题的对应图片都存储在 <figure>
标签内,而不是 <img>
具有特定 src
属性的标签。
我如何才能抓取那些 <figure>
标签内图片的 URL,以便我能够将它们汇总到我自己的网站上?
由于 url 是动态的,您可以使用 selenium 和 BeautifulSoup.
获取主标题的所有图像 urlsfrom selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
data=[]
driver = webdriver.Chrome(ChromeDriverManager().install())
url='https://www.nytimes.com/'
driver.get(url)
driver.maximize_window()
soup=BeautifulSoup(driver.page_source,'html.parser')
driver.close()
for im in soup.select('.css-cov0u6 img'):
img=im.get('src')
data.append(img)
#print(img)
print(data)
输出:
https://static01.nyt.com/images/2022/04/14/multimedia/14musk-twitter/14musk-twitter-threeByTwoMediumAt2X-v2.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/nyregion/14nyshooting/merlin_205419441_07391422-eea0-4436-97e3-c253e755010a-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/climate/00virus-case-counts1/00virus-case-counts1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/01/world/00africa-france-4/merlin_188413827_06ae2d07-ecd5-4090-ba71-815f5faee66b-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14spiers-image/14spiers-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14reinhart-main/14reinhart-main-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/realestate/14HUNT-WINTHUR1/14HUNT-WINTHUR1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/world/14japan-toddlers1/14japan-toddlers1-threeByTwoMediumAt2X-v3.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/magazine/17mag-studies_01/17mag-studies_01-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/opinion/13coy-image/13coy-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/climate/12cli-newsletter-cup-still/12cli-newsletter-cup-still-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12krugman_newsletter_1/12krugman_newsletter_1-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12McWhorter-image/12McWhorter-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/climate/14cli-cactus1/14cli-cactus1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/00well-mental-apps/00well-mental-apps-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/12WELL-USPSTF-SCREENING2/merlin_181619943_befc32d6-5803-4885-9c6c-4369f50d80ae-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/10/well/06ASKWELL-ADHD1/06ASKWELL-ADHD1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2019/05/07/parenting/07-parenting-postpartumdep/07-parenting-postpartumdep-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2017/04/11/science/physed-breathing/physed-breathing-videoSixteenByNine1050.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/arts/17skarsgard1/17skarsgard1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/01/world/00Israel-Art01/merlin_202656597_dc718c26-d9ff-45c5-a300-90800c78ac10-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/fashion/14ASHLEY1/14ASHLEY1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/arts/13Fidelio-deaf-9/17INTIMACY-BALLET-9-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/dining/08Appe1/merlin_204759306_c259077b-a1ec-47ac-bb51-4113304a3282-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2019/04/18/homepage/spelling-bee-logo-bulletin/spelling-bee-logo-bulletin-square320-v5.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/02/crosswords/alpha-wordle-icon-new/alpha-wordle-icon-new-square320-v2.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/crossword-logo-nytgames-hires/crossword-logo-nytgames-hires-square320-v3.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/03/crosswords/nyt-games-homepage-playmodule-subscribe/nyt-games-homepage-playmodule-subscribe-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/05/27/multimedia/alpha-letterboxed-promo-1622145789727/alpha-letterboxed-promo-1622145789727-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/tiles-logo-nytgames-hi-res/tiles-logo-nytgames-hi-res-square320-v4.png?format=pjpg&quality=75&auto=webp&disable=upscale