python如何提取br后面的文字?
python how to extract text after br?
我正在使用 2.7.8 并且有点意外 bcz 我收到了所有文本,但没有收到最后一个 <"br"> 之后包含的文本。喜欢我的 html 页面:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p> <!--not getting-->
<p> more </p>
<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!
</div>
</body>
</html>
和我的代码:
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
# print '...........sfsdsds.............',answ[0].encode('utf-8') #
输出:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
但是我没有得到最后一个 "text" 例如:
d) int $main
and
d) Variable can be of any length
在 <"br">
之后
以及我试图获得的输出:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
d) int $main
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
d) Variable can be of any length
这是因为 BeautifulSoup 通过关闭 </p>
之前的 <br>
标签强制使文本有效 xml。美化版说的很清楚:
<p>
Which of the following is not a valid C variable name?
<br>
a) int number;
<br>
b) float rate;
<br>
c) int variable_count;
<br>
d) int $main;
</br>
</br>
</br>
</br>
</p>
所以文本 d) int $main;
不是最后一个 <br>
标签的同级 ,但是 是文本 这个标签。
代码可以是(此处):
...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
if len(br.contents) > 0: # avoid errors if a tag is correctly closed as <br/>
print 'Found', br.contents[0]
它给出了预期的结果:
Found
a) int number;
Found
b) float rate;
Found
c) int variable_count;
Found
d) int $main;
Found
a) They can contain alphanumeric characters as well as special characters
Found
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found
c) Variable names cannot start with a digit
Found
d) Variable can be of any length
您可以使用 Requests instead of urllib2, and extract xml via lxml 的 html 模块。
from lxml import html
import requests
#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#get content in html format
page_content=html.fromstring(page.content)
#recover all text from <p> elements
items=page_content.xpath('//p/text()')
以上代码 returns <a>
元素中包含文档中所有文本的数组。
有了它,您可以简单地索引到数组中以打印您想要的内容。
我正在使用 2.7.8 并且有点意外 bcz 我收到了所有文本,但没有收到最后一个 <"br"> 之后包含的文本。喜欢我的 html 页面:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p> <!--not getting-->
<p> more </p>
<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!
</div>
</body>
</html>
和我的代码:
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
# print '...........sfsdsds.............',answ[0].encode('utf-8') #
输出:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
但是我没有得到最后一个 "text" 例如:
d) int $main
and
d) Variable can be of any length
在 <"br">
之后以及我试图获得的输出:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
d) int $main
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
d) Variable can be of any length
这是因为 BeautifulSoup 通过关闭 </p>
之前的 <br>
标签强制使文本有效 xml。美化版说的很清楚:
<p>
Which of the following is not a valid C variable name?
<br>
a) int number;
<br>
b) float rate;
<br>
c) int variable_count;
<br>
d) int $main;
</br>
</br>
</br>
</br>
</p>
所以文本 d) int $main;
不是最后一个 <br>
标签的同级 ,但是 是文本 这个标签。
代码可以是(此处):
...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
if len(br.contents) > 0: # avoid errors if a tag is correctly closed as <br/>
print 'Found', br.contents[0]
它给出了预期的结果:
Found
a) int number;
Found
b) float rate;
Found
c) int variable_count;
Found
d) int $main;
Found
a) They can contain alphanumeric characters as well as special characters
Found
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found
c) Variable names cannot start with a digit
Found
d) Variable can be of any length
您可以使用 Requests instead of urllib2, and extract xml via lxml 的 html 模块。
from lxml import html
import requests
#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#get content in html format
page_content=html.fromstring(page.content)
#recover all text from <p> elements
items=page_content.xpath('//p/text()')
以上代码 returns <a>
元素中包含文档中所有文本的数组。
有了它,您可以简单地索引到数组中以打印您想要的内容。