如何提取"br"之前的文字?
How to extract text before "br"?
我有一个小问题。我正在使用 python 2.7.8。我正在尝试提取应该在 <br> 之前的文本。我喜欢:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>1. C99 standard guarantees uniqueness of ____ characters for internal names.<br>
a) 31<br>
b) 63<br>
c) 12<br>
d) 14</p>
<p> more </p>
<p>2. C99 standard guarantess uniqueness of _____ characters for external names.<br>
a) 31<br>
b) 6<br>
c) 12<br>
d) 14</p>
</div>
</body>
</html>
我尝试过的代码目前在 <br> 之后,而不是在 br.Here 之前,代码是:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
soup2 = BeautifulSoup(htmls)
for br2 in soup2.findAll('br'):
next = br2.previousSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.previousSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
输出结果是:
Found:
a) 31
Found:
b) 63
Found:
c) 12
Found:
d) 14
a) 31
Found:
b) 6
Found:
c) 12
Found:
d) 14
Found:
知道我哪里做错了吗。
首先,我会改用 BeautifulSoup
version 4。 BeautifulSoup3已经很老了,不再维护了:
Beautiful Soup 3 has been replaced by Beautiful Soup 4.
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also
works on Python 3.x. Beautiful Soup 4 is faster, has more features,
and works with third-party parsers like lxml and html5lib. Once the
beta period is over, you should use Beautiful Soup 4 for all new
projects.
运行:
pip install beautifulsoup4
并更改您的导入语句:
from BeautifulSoup import BeautifulSoup
至:
from bs4 import BeautifulSoup
现在,我要做的是找到问题文本和 get the following br
siblings。对于每个兄弟姐妹,得到 next_sibling
这将是答案选项。工作代码:
soup = BeautifulSoup(data, "html5lib") # using "html5lib" parser here
for question in soup.find_all(text=re.compile(r"^\d+\.")):
answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
print(question)
print(answers)
print("------")
对于问题中提供的示例 HTML,它打印:
1. C99 standard guarantees uniqueness of ____ characters for internal names.
[u'a) 31', u'b) 63', u'c) 12', u'd) 14']
------
2. C99 standard guarantess uniqueness of _____ characters for external names.
[u'a) 31', u'b) 6', u'c) 12', u'd) 14']
------
请注意,您可能需要安装 html5lib
library:
pip install html5lib
我有一个小问题。我正在使用 python 2.7.8。我正在尝试提取应该在 <br> 之前的文本。我喜欢:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>1. C99 standard guarantees uniqueness of ____ characters for internal names.<br>
a) 31<br>
b) 63<br>
c) 12<br>
d) 14</p>
<p> more </p>
<p>2. C99 standard guarantess uniqueness of _____ characters for external names.<br>
a) 31<br>
b) 6<br>
c) 12<br>
d) 14</p>
</div>
</body>
</html>
我尝试过的代码目前在 <br> 之后,而不是在 br.Here 之前,代码是:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
soup2 = BeautifulSoup(htmls)
for br2 in soup2.findAll('br'):
next = br2.previousSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.previousSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
输出结果是:
Found:
a) 31
Found:
b) 63
Found:
c) 12
Found:
d) 14
a) 31
Found:
b) 6
Found:
c) 12
Found:
d) 14
Found:
知道我哪里做错了吗。
首先,我会改用 BeautifulSoup
version 4。 BeautifulSoup3已经很老了,不再维护了:
Beautiful Soup 3 has been replaced by Beautiful Soup 4.
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. Once the beta period is over, you should use Beautiful Soup 4 for all new projects.
运行:
pip install beautifulsoup4
并更改您的导入语句:
from BeautifulSoup import BeautifulSoup
至:
from bs4 import BeautifulSoup
现在,我要做的是找到问题文本和 get the following br
siblings。对于每个兄弟姐妹,得到 next_sibling
这将是答案选项。工作代码:
soup = BeautifulSoup(data, "html5lib") # using "html5lib" parser here
for question in soup.find_all(text=re.compile(r"^\d+\.")):
answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
print(question)
print(answers)
print("------")
对于问题中提供的示例 HTML,它打印:
1. C99 standard guarantees uniqueness of ____ characters for internal names.
[u'a) 31', u'b) 63', u'c) 12', u'd) 14']
------
2. C99 standard guarantess uniqueness of _____ characters for external names.
[u'a) 31', u'b) 6', u'c) 12', u'd) 14']
------
请注意,您可能需要安装 html5lib
library:
pip install html5lib