删除 <wbr> 标签并获取两者之间的信息

Question

我正在从网页中抓取数据，并针对具有 <br> 标记的特定部分进行了抓取。

<div class="scrollWrapper">
    <h3>Smiles</h3>
    CC=O<br>
    <button type="button" id="downloadSmiles">Download</button>
</div>

我通过执行以下脚本输出 CC=O 解决了这个问题。

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text)
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

但是，当我浏览不同化学品的其他页面时，我遇到了一些带有标签的页面。我不知道如何在获取它们之间的信息时摆脱它们。下面显示了一个示例，我想要的输出是 c1(c2ccccc2)ccc(N)cc1.

<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>

Answer 1

<wbr>

The (Word Break Opportunity) tag specifies where in a text it would be ok to add a line-break. Tip: When a word is too long, or you are afraid that the browser will break your lines at the wrong place, you can use the element to add word break opportunities.

我使用BeautifulSoup来解析这个数据。

from bs4 import BeautifulSoup as bs

html = """
<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>
"""

soup = bs(html, "html.parser")
rows = soup.get_text().split()
print(rows[1])

输出：

   c1(c2ccccc2)ccc(N)cc1

Answer 2

只是要指出：您可以通过执行以下操作来删除特定的字符串：

str.replace(old, "")

例如：

"c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>".replace("<wbr>", "").replace("<br>", "")

然而，其他答案更接近期望的结果。

Answer 3

最简单的做法是将 page.text 中的 <wbr> 字符串替换为空字符串，然后再将其解析为 html。因为它在 < 和 > 之内，我怀疑你正在寻找的任何有用信息是否有它。

例子-

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text.replace('<wbr>',''))
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

否则你可以使用@Bun 的使用 BeautifulSoup 的解决方案，或者编写复杂的 xpaths。

此外，更简单的 xpath 应该是 -

'normalize-space(//*[text()="Smiles"]/following-sibling::text()[1])'

而不是找出 Smiles 元素，然后取其父元素，然后找出作为其后代的第一个 br 元素，然后取其前面的兄弟元素，然后取其文本。

你应该直接为 Smiles 元素取下面的兄弟，然后是它的文本。

删除 <wbr> 标签并获取两者之间的信息

Removing <wbr> tags and grabbing the info between

python

lxml

wbr