如果它们属于字符串列表,请删除字符串段落中的项目?
Remove items in string paragraph if they belong to a list of strings?
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
使用 BeautifulSoup
,我从上述网站上删除了奥巴马的一篇演讲。现在,我需要以高效的方式替换一些残留的 HTML 。我在 remove_char
中存储了我想要消除的元素列表。我正在尝试编写一个简单的 for
语句,但出现错误:TypeError: expected a character object buffer
。我知道这是一个初学者问题,但我该如何解决这个问题?
由于您已经在使用 BeautifulSoup
,您可以直接使用 obama_4427_div.text
而不是 str(obama_4427_div)
来获得格式正确的文本。然后你得到的文本将不包含任何残留的 html
元素等
例子-
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
为了完整起见,为了从字符串中删除元素,我会创建一个要删除的元素列表(就像您创建的 remove_char
列表一样),然后我们可以对字符串执行 str.replace()
对于列表中的每个元素。示例 -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
使用 BeautifulSoup
,我从上述网站上删除了奥巴马的一篇演讲。现在,我需要以高效的方式替换一些残留的 HTML 。我在 remove_char
中存储了我想要消除的元素列表。我正在尝试编写一个简单的 for
语句,但出现错误:TypeError: expected a character object buffer
。我知道这是一个初学者问题,但我该如何解决这个问题?
由于您已经在使用 BeautifulSoup
,您可以直接使用 obama_4427_div.text
而不是 str(obama_4427_div)
来获得格式正确的文本。然后你得到的文本将不包含任何残留的 html
元素等
例子-
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
为了完整起见,为了从字符串中删除元素,我会创建一个要删除的元素列表(就像您创建的 remove_char
列表一样),然后我们可以对字符串执行 str.replace()
对于列表中的每个元素。示例 -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')