Python 只为 CSV 文件写入 1 行
Python writing only 1 line for CSV file
我很抱歉再次提出这个问题,但是,它仍然没有得到解决。
这不是一个非常复杂的问题,我确信它相当简单,但我就是看不出问题所在。
我用于解析 XML 文件的代码已打开并以我想要的格式读取 - 最后的 for 循环中的 print 语句证明了这一点。
例如它输出这个:
Pivoting support handle D0584129 20090106 US
Hinge D0584130 20090106 US
Deadbolt turnpiece D0584131 20090106 US
这正是我希望将数据写入 CSV 文件的方式。但是,当我尝试将这些作为行实际写入 CSV 本身时,它只打印 XML 文件中的最后一行之一,并以这种方式:
Flashlight package,D0584138,20090106,US
这是我的全部代码,因为它可能有助于理解整个过程,其中感兴趣的区域是 separated_xml 中的 for xml_string 开始的地方:
from bs4 import BeautifulSoup
import csv
import unicodecsv as csv
infile = "C:\Users\Grisha\Documents\Inventor\2009_Data\Jan\ipg090106.xml"
# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with
def separated_xml(infile): # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element
file = open(infile, "r") # Used to open the xml file
buffer = [file.readline()] # Used to read each line and placing inside vector
# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors
for line in file: # Running for-loop for the opened file and searches for root elements
if line.startswith("<?xml "):
yield "".join(buffer) # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
buffer = [] # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
buffer.append(line) # Passes lines into list
yield "".join(buffer) # Outputs
file.close()
# The second nested set of for-loops are used to parse the newly reformatted data into a new list
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
print(inv_name.text, pat_num.text, date_num.text, country.text)
lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
我也试过将open和writer放在for循环之外来检查问题出在哪里,但无济于事。我知道该文件一次只写 1 行并一遍又一遍地覆盖同一行(这就是 CSV 文件中只保留 1 行的原因),我就是看不到它。
非常感谢您的提前帮助。
我相信(无论如何第一个工作理论)你的问题的基础是你的 with open
语句落在你的 for 循环中,并且使用 "wb" 的模式覆盖文件,如果它已经存在。这意味着每次您的 for 循环运行时,它都会覆盖之前存在的所有内容,并且在完成后只留下一行输出。
我认为您可以通过两种方式处理此问题。更正确的方法是将文件打开语句移到最外层的 for 循环之外。我知道你提到你已经尝试过这个,但细节决定成败。这将使您的更新代码看起来像这样:
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect='excel')
for xml_string in separated_xml(infile):
soup = BeautifulSoup(xml_string, "lxml")
pub_ref = soup.findAll("publication-reference")
lst = []
for info in pub_ref:
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
print(inv_name.text, pat_num.text, date_num.text, country.text)
lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
一种笨拙但更快更简单的方法是简单地将 open 调用中的模式更改为 "ab" (追加,二进制)而不是 "wb" (写入二进制,它会覆盖任何现有的数据)。这效率要低得多,因为您每次通过 for 循环仍然重新打开文件,但它可能会起作用。
希望对您有所帮助!
with open('./output.csv', 'wb') as f:
只需要更改 'wb' -> 'ab' 即可不覆盖。
第一次没用,但在最后 2 个循环之前移动打开函数解决了这个问题。感谢帮助过的人。
我很抱歉再次提出这个问题,但是,它仍然没有得到解决。
这不是一个非常复杂的问题,我确信它相当简单,但我就是看不出问题所在。
我用于解析 XML 文件的代码已打开并以我想要的格式读取 - 最后的 for 循环中的 print 语句证明了这一点。
例如它输出这个:
Pivoting support handle D0584129 20090106 US
Hinge D0584130 20090106 US
Deadbolt turnpiece D0584131 20090106 US
这正是我希望将数据写入 CSV 文件的方式。但是,当我尝试将这些作为行实际写入 CSV 本身时,它只打印 XML 文件中的最后一行之一,并以这种方式:
Flashlight package,D0584138,20090106,US
这是我的全部代码,因为它可能有助于理解整个过程,其中感兴趣的区域是 separated_xml 中的 for xml_string 开始的地方:
from bs4 import BeautifulSoup
import csv
import unicodecsv as csv
infile = "C:\Users\Grisha\Documents\Inventor\2009_Data\Jan\ipg090106.xml"
# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with
def separated_xml(infile): # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element
file = open(infile, "r") # Used to open the xml file
buffer = [file.readline()] # Used to read each line and placing inside vector
# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors
for line in file: # Running for-loop for the opened file and searches for root elements
if line.startswith("<?xml "):
yield "".join(buffer) # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
buffer = [] # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
buffer.append(line) # Passes lines into list
yield "".join(buffer) # Outputs
file.close()
# The second nested set of for-loops are used to parse the newly reformatted data into a new list
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
print(inv_name.text, pat_num.text, date_num.text, country.text)
lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
我也试过将open和writer放在for循环之外来检查问题出在哪里,但无济于事。我知道该文件一次只写 1 行并一遍又一遍地覆盖同一行(这就是 CSV 文件中只保留 1 行的原因),我就是看不到它。
非常感谢您的提前帮助。
我相信(无论如何第一个工作理论)你的问题的基础是你的 with open
语句落在你的 for 循环中,并且使用 "wb" 的模式覆盖文件,如果它已经存在。这意味着每次您的 for 循环运行时,它都会覆盖之前存在的所有内容,并且在完成后只留下一行输出。
我认为您可以通过两种方式处理此问题。更正确的方法是将文件打开语句移到最外层的 for 循环之外。我知道你提到你已经尝试过这个,但细节决定成败。这将使您的更新代码看起来像这样:
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect='excel')
for xml_string in separated_xml(infile):
soup = BeautifulSoup(xml_string, "lxml")
pub_ref = soup.findAll("publication-reference")
lst = []
for info in pub_ref:
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
print(inv_name.text, pat_num.text, date_num.text, country.text)
lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
一种笨拙但更快更简单的方法是简单地将 open 调用中的模式更改为 "ab" (追加,二进制)而不是 "wb" (写入二进制,它会覆盖任何现有的数据)。这效率要低得多,因为您每次通过 for 循环仍然重新打开文件,但它可能会起作用。
希望对您有所帮助!
with open('./output.csv', 'wb') as f:
只需要更改 'wb' -> 'ab' 即可不覆盖。
第一次没用,但在最后 2 个循环之前移动打开函数解决了这个问题。感谢帮助过的人。