CSV 在不应该的时候创建新行
CSV creating new lines when it shouldn't
我目前正在为自己做一个项目,其中包括抓取 this specific website.
我的代码目前是这样的:
for i in range(0,4):
my_url = 'https://www.kickante.com.br/campanhas-crowdfunding?page='+str(i)
uclient = ureq(my_url)
page_html = uclient.read()
uclient.close()
page_soup = soup(page_html, 'html.parser')
containers = page_soup.find_all("div", {"class":"campaign-card-wrapper views-row"})
for container in containers:
#Achando os títulos das campanhas
titleCampaignBruto = container.div.div.a.img["title"].replace('Crowdfunding para: ', '')
titleCampaignParsed = titleCampaignBruto.strip().replace(",", ";")
#Achando o valor da campanha
arrecadadoFind = container.div.find_all("div",{"class":"funding-raised"})
arrecadado = arrecadadoFind[0].text.strip().replace(",", ".")
#Número de doadores
doadoresBruto = container.div.find_all('span', {"class":"contributors-value"})
doadoresParsed = doadoresBruto[0].text.strip().replace(",",";")
#target da campanha
fundingGoal = container.div.find_all('div', {"class":"funding-progress"})
quantoArrecadado = fundingGoal[0].text.strip().replace(",",";")
#Descricao da campanha
descricaoBruta = container.div.find_all('div', {"class":"field field-name-field-short-description field-type-text-long field-label-hidden"})
descricaoParsed = descricaoBruta[0].text.strip().replace(",",";")
#link da campanha
linkCampanha = container.div.find_all('href')
print("Título da campanha: " + titleCampaignParsed)
print("Valor da campanha: " +arrecadado)
print("Doadores: "+ doadoresParsed)
print("target: " + quantoArrecadado)
print("descricao: " + descricaoParsed)
f.write(titleCampaignParsed + "," + arrecadado + "," + doadoresParsed + "," + quantoArrecadado+ "," + descricaoParsed.replace("," ,";") + "\n")
i = i+1
f.close()
当我打开它生成的 csv 文件时,我发现有些行在不应该出现的地方被打断了(例如:See line 31 on the csv file)。该行应该是上一行(第 30 行)的一部分作为描述的主体。
有人知道是什么原因造成的吗?提前致谢。
您写入 CSV 的某些文本可能包含换行符。您可以像这样删除它们:
csv_line_entries = [
titleCampaignParsed, arrecadado, doadoresParsed,
quantoArrecadado, descricaoParsed.replace("," ,";")
]
csv_line = ','.join([
entry.replace('\n', ' ') for entry in csv_line_entries
])
f.write(csv_line + '\n')
错误原因
strip()
方法仅删除前导和尾随 newlines/whitespace。
import bs4
soup = bs4.BeautifulSoup('<p>Whatever\nelse\n</p>')
soup.find('p').text.strip()
>>> 'Whatever\nelse'
注意内部 \n
没有被删除。
文本中间有换行符。 strip()
仅删除字符串开头和结尾的空格,因此您需要改用 replace('\n','')
。这会将所有换行符 \n
替换为空 ''
我目前正在为自己做一个项目,其中包括抓取 this specific website.
我的代码目前是这样的:
for i in range(0,4):
my_url = 'https://www.kickante.com.br/campanhas-crowdfunding?page='+str(i)
uclient = ureq(my_url)
page_html = uclient.read()
uclient.close()
page_soup = soup(page_html, 'html.parser')
containers = page_soup.find_all("div", {"class":"campaign-card-wrapper views-row"})
for container in containers:
#Achando os títulos das campanhas
titleCampaignBruto = container.div.div.a.img["title"].replace('Crowdfunding para: ', '')
titleCampaignParsed = titleCampaignBruto.strip().replace(",", ";")
#Achando o valor da campanha
arrecadadoFind = container.div.find_all("div",{"class":"funding-raised"})
arrecadado = arrecadadoFind[0].text.strip().replace(",", ".")
#Número de doadores
doadoresBruto = container.div.find_all('span', {"class":"contributors-value"})
doadoresParsed = doadoresBruto[0].text.strip().replace(",",";")
#target da campanha
fundingGoal = container.div.find_all('div', {"class":"funding-progress"})
quantoArrecadado = fundingGoal[0].text.strip().replace(",",";")
#Descricao da campanha
descricaoBruta = container.div.find_all('div', {"class":"field field-name-field-short-description field-type-text-long field-label-hidden"})
descricaoParsed = descricaoBruta[0].text.strip().replace(",",";")
#link da campanha
linkCampanha = container.div.find_all('href')
print("Título da campanha: " + titleCampaignParsed)
print("Valor da campanha: " +arrecadado)
print("Doadores: "+ doadoresParsed)
print("target: " + quantoArrecadado)
print("descricao: " + descricaoParsed)
f.write(titleCampaignParsed + "," + arrecadado + "," + doadoresParsed + "," + quantoArrecadado+ "," + descricaoParsed.replace("," ,";") + "\n")
i = i+1
f.close()
当我打开它生成的 csv 文件时,我发现有些行在不应该出现的地方被打断了(例如:See line 31 on the csv file)。该行应该是上一行(第 30 行)的一部分作为描述的主体。
有人知道是什么原因造成的吗?提前致谢。
您写入 CSV 的某些文本可能包含换行符。您可以像这样删除它们:
csv_line_entries = [
titleCampaignParsed, arrecadado, doadoresParsed,
quantoArrecadado, descricaoParsed.replace("," ,";")
]
csv_line = ','.join([
entry.replace('\n', ' ') for entry in csv_line_entries
])
f.write(csv_line + '\n')
错误原因
strip()
方法仅删除前导和尾随 newlines/whitespace。
import bs4
soup = bs4.BeautifulSoup('<p>Whatever\nelse\n</p>')
soup.find('p').text.strip()
>>> 'Whatever\nelse'
注意内部 \n
没有被删除。
文本中间有换行符。 strip()
仅删除字符串开头和结尾的空格,因此您需要改用 replace('\n','')
。这会将所有换行符 \n
替换为空 ''