不能将超过一行写入 CSV
Can't write more than one line to CSV
我构建了一个网络抓取工具,可以提取网站上的所有图像。我的代码应该将每个 img URL 打印到标准输出并写入一个包含所有这些的 csv 文件,但现在它只将找到的最后一个图像写入文件并将该结果的编号写入 csv .
这是我目前使用的代码:
# This program prints a list of all images contained in a web page
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls ex. https://www.python.org/: ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
line = str(line, encoding)
#defines imgTag
imgTag = '<img '
#goes to next piece of logical flow
if imgTag in line :
i = i+1
srcAttribute = 'src="'
if srcAttribute in line:
#parses the html retrieved from user input
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
#prints results
print("[ ",[i], reline , " ]")
data = [[i, reline]]
output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()
如何让这个程序将找到的所有图像写入 CSV 文件?
您只能在 csv 中看到最后的结果,因为 data
从未在 for 循环范围内正确更新:您只在退出时写入一次循环。要将 HTML 的所有相关部分添加到您的列表 data
,您应该 缩进 该行并使用 append
或 extend
列表的方法。
因此,如果您将循环重写为:
img_nbr = 0 # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here
for line in webPage:
line = str(line, encoding)
if imgTag in line :
img_nbr += 1 # += saves you typing a few keystrokes and a possible future find-replace.
#if srcAttribute in line: # this check and the next do nearly the same: get rid of one
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
data.append((img_nbr, reline)) # This is what you really missed.
你会得到更好的结果。我添加了一些评论来为您的编码技巧提供一些建议,并删除了您的评论以使新评论脱颖而出。
但是,您的代码仍然存在一些问题:HTML 不应使用正则表达式进行解析,除非源代码的结构极其良好(即便如此...)。现在,因为您要求用户输入,他们可以给出任何 url,而且网页的结构往往会很糟糕。如果您想构建更强大的网络抓取工具,我建议您查看 BeautifulSoup。
我构建了一个网络抓取工具,可以提取网站上的所有图像。我的代码应该将每个 img URL 打印到标准输出并写入一个包含所有这些的 csv 文件,但现在它只将找到的最后一个图像写入文件并将该结果的编号写入 csv .
这是我目前使用的代码:
# This program prints a list of all images contained in a web page
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls ex. https://www.python.org/: ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
line = str(line, encoding)
#defines imgTag
imgTag = '<img '
#goes to next piece of logical flow
if imgTag in line :
i = i+1
srcAttribute = 'src="'
if srcAttribute in line:
#parses the html retrieved from user input
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
#prints results
print("[ ",[i], reline , " ]")
data = [[i, reline]]
output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()
如何让这个程序将找到的所有图像写入 CSV 文件?
您只能在 csv 中看到最后的结果,因为 data
从未在 for 循环范围内正确更新:您只在退出时写入一次循环。要将 HTML 的所有相关部分添加到您的列表 data
,您应该 缩进 该行并使用 append
或 extend
列表的方法。
因此,如果您将循环重写为:
img_nbr = 0 # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here
for line in webPage:
line = str(line, encoding)
if imgTag in line :
img_nbr += 1 # += saves you typing a few keystrokes and a possible future find-replace.
#if srcAttribute in line: # this check and the next do nearly the same: get rid of one
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
data.append((img_nbr, reline)) # This is what you really missed.
你会得到更好的结果。我添加了一些评论来为您的编码技巧提供一些建议,并删除了您的评论以使新评论脱颖而出。
但是,您的代码仍然存在一些问题:HTML 不应使用正则表达式进行解析,除非源代码的结构极其良好(即便如此...)。现在,因为您要求用户输入,他们可以给出任何 url,而且网页的结构往往会很糟糕。如果您想构建更强大的网络抓取工具,我建议您查看 BeautifulSoup。