Python – 需要帮助将 <img> src 存储在 CSV 中,从 CSV 列表下载图像
Python – Need help storing <img> src's in CSV, download images from CSV list
我需要帮助。
此代码当前从所需页面上的所有
获取所有 src 属性,将 URLs 存储在 csv 文件中(它很乱 https://i.imgur.com/w1slgf6.png),并从中下载第一张图片第一个 URL.
一切都很好,但是我想下载所有照片,而不仅仅是第一张照片。 (并希望从代码中清理该 CSV 文件)
旁注:我知道我不需要创建 CSV 来下载图像。我的目标是将所有 img URLs 存储到 CSV 中,然后从 CSV
中的 URLs 下载图像
任何帮助!
from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time
# Get site
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
page = driver.page_source
soup = BeautifulSoup(page)
# Gets srcs from all <img> from site
srcs = [img['src'] for img in soup.findAll('img')]
# BELOW code = Writer writes all urls WITH comma after them
print ('Downloading URLs to file')
sleep(1)
with open('output.csv', 'w', newline='\n', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(srcs)
# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls
print ('Downloading images to folder')
sleep(1)
filename = "output"
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
splitted_line = line.split(',')
# check if we have an image URL
if splitted_line[1] != '' and splitted_line[1] != "\n":
urllib.request.urlretrieve(splitted_line[1], "img_" + str(i) + ".png")
print ("Image saved for {0}".format(splitted_line[0]))
i += 1
else:
print ("No result for {0}".format(splitted_line[0]))
这是无 CSV 的解决方案:
import os
import requests
import urllib.request
from bs4 import BeautifulSoup
page = requests.get('https://igromania.ru').text
soup = BeautifulSoup(page)
tags = soup.findAll('img')
for tag in tags:
url = tag['src']
try:
urllib.request.urlretrieve(url, os.path.basename(url))
print(f'Image downloaded: {url}')
except ValueError:
print(f'Error downloading: {url}')
示例输出:
Error downloading: //cdn.igromania.ru/-Engine-/SiteTemplates/igromania/images/logo_mania.png
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/b/8/b/2904/preview/3d0a4043f5dfd3e9443ce0b27d2a8329_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/7/c/7/3124/preview/8df8f4505157e4928187b5450c03e82b_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/6/8/2912/preview/4a70f416181b77f6b543053ea8e5d300_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/2/e/0/3123/preview/0eb2f280f1b9e089d5a12bc0df1120bc_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/9/2/3130/preview/29e962c5444f67fa95b3714c7ae7683f_400x225.jpg
这是另一种保留 CSV 的解决方案。
from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time
# Get site
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
#page = driver.page_source
page = "https://unsplash.com/"
r = requests.get(page)
soup = BeautifulSoup(r.text, "html.parser")
# Gets srcs from all <img> from site
srcs = [img['src'] for img in soup.findAll('img')]
# BELOW code = Writer writes all urls WITH comma after them
print ('Downloading URLs to file')
sleep(1)
with open('output.csv', 'w', newline='\n', encoding='utf-8') as csvfile:
# writer = csv.writer(csvfile)
for i,s in enumerate(srcs): # each image number and URL
csvfile.write(str(i) +','+s+'\n')
# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls
print ('Downloading images to folder')
sleep(1)
filename = "output"
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
splitted_line = line.split(',')
# check if we have an image URL
if splitted_line[1] != '' and splitted_line[1] != "\n":
urllib.request.urlretrieve(splitted_line[1], "img_" + str(i) + ".png")
print ("Image saved for {0}".format(splitted_line[0]))
i += 1
else:
print ("No result for {0}".format(splitted_line[0]))
输出(output.csv)
0,https://sb.scorecardresearch.com/p?c1=2&c2=32343279&cv=2.0&cj=1
1,https://images.unsplash.com/photo-1597523565663-916cf059f524?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
2,https://images.unsplash.com/profile-1574526450714-e5d331168827image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
3,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
4,https://images.unsplash.com/profile-1583427783052-3da8ceab5579image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
5,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
6,https://images.unsplash.com/profile-1545567671893-842f479b15e2?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
7,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
8,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
9,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
10,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
我需要帮助。
此代码当前从所需页面上的所有 获取所有 src 属性,将 URLs 存储在 csv 文件中(它很乱 https://i.imgur.com/w1slgf6.png),并从中下载第一张图片第一个 URL.
一切都很好,但是我想下载所有照片,而不仅仅是第一张照片。 (并希望从代码中清理该 CSV 文件)
旁注:我知道我不需要创建 CSV 来下载图像。我的目标是将所有 img URLs 存储到 CSV 中,然后从 CSV
中的 URLs 下载图像任何帮助!
from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time
# Get site
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
page = driver.page_source
soup = BeautifulSoup(page)
# Gets srcs from all <img> from site
srcs = [img['src'] for img in soup.findAll('img')]
# BELOW code = Writer writes all urls WITH comma after them
print ('Downloading URLs to file')
sleep(1)
with open('output.csv', 'w', newline='\n', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(srcs)
# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls
print ('Downloading images to folder')
sleep(1)
filename = "output"
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
splitted_line = line.split(',')
# check if we have an image URL
if splitted_line[1] != '' and splitted_line[1] != "\n":
urllib.request.urlretrieve(splitted_line[1], "img_" + str(i) + ".png")
print ("Image saved for {0}".format(splitted_line[0]))
i += 1
else:
print ("No result for {0}".format(splitted_line[0]))
这是无 CSV 的解决方案:
import os
import requests
import urllib.request
from bs4 import BeautifulSoup
page = requests.get('https://igromania.ru').text
soup = BeautifulSoup(page)
tags = soup.findAll('img')
for tag in tags:
url = tag['src']
try:
urllib.request.urlretrieve(url, os.path.basename(url))
print(f'Image downloaded: {url}')
except ValueError:
print(f'Error downloading: {url}')
示例输出:
Error downloading: //cdn.igromania.ru/-Engine-/SiteTemplates/igromania/images/logo_mania.png
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/b/8/b/2904/preview/3d0a4043f5dfd3e9443ce0b27d2a8329_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/7/c/7/3124/preview/8df8f4505157e4928187b5450c03e82b_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/6/8/2912/preview/4a70f416181b77f6b543053ea8e5d300_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/2/e/0/3123/preview/0eb2f280f1b9e089d5a12bc0df1120bc_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/9/2/3130/preview/29e962c5444f67fa95b3714c7ae7683f_400x225.jpg
这是另一种保留 CSV 的解决方案。
from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time
# Get site
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
#page = driver.page_source
page = "https://unsplash.com/"
r = requests.get(page)
soup = BeautifulSoup(r.text, "html.parser")
# Gets srcs from all <img> from site
srcs = [img['src'] for img in soup.findAll('img')]
# BELOW code = Writer writes all urls WITH comma after them
print ('Downloading URLs to file')
sleep(1)
with open('output.csv', 'w', newline='\n', encoding='utf-8') as csvfile:
# writer = csv.writer(csvfile)
for i,s in enumerate(srcs): # each image number and URL
csvfile.write(str(i) +','+s+'\n')
# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls
print ('Downloading images to folder')
sleep(1)
filename = "output"
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
splitted_line = line.split(',')
# check if we have an image URL
if splitted_line[1] != '' and splitted_line[1] != "\n":
urllib.request.urlretrieve(splitted_line[1], "img_" + str(i) + ".png")
print ("Image saved for {0}".format(splitted_line[0]))
i += 1
else:
print ("No result for {0}".format(splitted_line[0]))
输出(output.csv)
0,https://sb.scorecardresearch.com/p?c1=2&c2=32343279&cv=2.0&cj=1
1,https://images.unsplash.com/photo-1597523565663-916cf059f524?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
2,https://images.unsplash.com/profile-1574526450714-e5d331168827image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
3,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
4,https://images.unsplash.com/profile-1583427783052-3da8ceab5579image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
5,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
6,https://images.unsplash.com/profile-1545567671893-842f479b15e2?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
7,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
8,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
9,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
10,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80