删除空格和换行符 - beautifulsoup python
Remove whitespace and newlines - beautifulsoup python
使用 Beautifulsoup,我正在抓取以下网络资源:
<div>
<p class="introduction"> Manchester City's Fabian Delph limped off in the first minute of England Euro 2016 qualifier against Switzerland with a suspected hamstring injury. </p>
<p> The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley. </p>
<p> Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday.</p>
<p> Delph was making his eighth appearance for England.</p>
</div>
我使用以下代码:
for item in soup.find_all('div'):
print item.find('p').text.replace('\n','')
这行得通,但结果看起来像这样(更像是四个独立的值):
Manchester City's Fabian Delph limped off in the first minute of England's Euro 2016 qualifier against Switzerland with a suspected hamstring injury.
The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley.
Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday.
Delph was making his eighth appearance for England.
如何获得以下格式的输出(更像是单个值):
Manchester City's Fabian Delph limped off in the first minute of England's Euro 2016 qualifier against Switzerland with a suspected hamstring injury. The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley. Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday. Delph was making his eighth appearance for England.
最终,我想将此数据保存在 csv 文件中。以上内容在csv文件中应视为单个值(不是四个值)。
你正在做的是调用打印函数。 print 只是将字符串打印到控制台,然后打印换行符。你可以有一个像下面这样的大字符串
big_string = ""
for item in soup.find_all('div'):
big_string += item.find('p').text.replace('\n','')
你可以试试:
divs = soup.find_all('div')
result = ''.join([div.find('p').text.replace('\n','') for div in divs])
print result
第二行获取列表中的所有div段文本并将它们一个接一个地加入。您可以检查 str.join 功能。
这种方法比对所有字符串求和(这也是一种有效、正确且足够好的方法)更快,因为它不会在此过程中创建额外的字符串。
您调用了 print 语句四次,因此它显示在四行上。
试试这个修改
single_string_answer = ''
for item in soup.find_all('div'):
item.find('p').text.replace('\n','')
single_string_answer += str(item)
print single_string_answer
使用 Beautifulsoup,我正在抓取以下网络资源:
<div>
<p class="introduction"> Manchester City's Fabian Delph limped off in the first minute of England Euro 2016 qualifier against Switzerland with a suspected hamstring injury. </p>
<p> The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley. </p>
<p> Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday.</p>
<p> Delph was making his eighth appearance for England.</p>
</div>
我使用以下代码:
for item in soup.find_all('div'):
print item.find('p').text.replace('\n','')
这行得通,但结果看起来像这样(更像是四个独立的值):
Manchester City's Fabian Delph limped off in the first minute of England's Euro 2016 qualifier against Switzerland with a suspected hamstring injury.
The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley.
Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday.
Delph was making his eighth appearance for England.
如何获得以下格式的输出(更像是单个值):
Manchester City's Fabian Delph limped off in the first minute of England's Euro 2016 qualifier against Switzerland with a suspected hamstring injury. The 25-year-old midfielder, who signed for City from Aston Villa in the summer, pulled up suddenly during Tuesday's game at Wembley. Delph was picked in Roy Hodgson's first XI having been left out of the starting line-up against San Marino on Saturday. Delph was making his eighth appearance for England.
最终,我想将此数据保存在 csv 文件中。以上内容在csv文件中应视为单个值(不是四个值)。
你正在做的是调用打印函数。 print 只是将字符串打印到控制台,然后打印换行符。你可以有一个像下面这样的大字符串
big_string = ""
for item in soup.find_all('div'):
big_string += item.find('p').text.replace('\n','')
你可以试试:
divs = soup.find_all('div')
result = ''.join([div.find('p').text.replace('\n','') for div in divs])
print result
第二行获取列表中的所有div段文本并将它们一个接一个地加入。您可以检查 str.join 功能。
这种方法比对所有字符串求和(这也是一种有效、正确且足够好的方法)更快,因为它不会在此过程中创建额外的字符串。
您调用了 print 语句四次,因此它显示在四行上。
试试这个修改
single_string_answer = ''
for item in soup.find_all('div'):
item.find('p').text.replace('\n','')
single_string_answer += str(item)
print single_string_answer