从 html 文档中提取标签内的文本

Question

我有一个这样的 html 文档：https://dropmefiles.com/wezmb 所以我需要提取标签内的文本

from bs4 import BeautifulSoup

with open("10_01.htm") as fp:
    soup = BeautifulSoup(fp,features="html.parser")
    for a in soup.find_all('span'):
      print (a.string)

但它从所有 'span' 标签中提取所有信息。那么，我如何提取标签内的文本

Answer 1

你需要的是.contents功能。 documentation

使用

查找跨度<span id = "1"> ... </span>

for x in soup.find(id = 1).contents:
    print(x)

或

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

这会给你：

即一个空行后跟10再跟一个空行。这是因为 HTML 中的字符串实际上就是这样，并在新行中打印 10，正如您在 HTML 中也可以看到的那样，10 有其单独的一行。
该字符串将正确地为 '\n10\n'.

如果您只想要 x = '\n10\n' 中的 x = '10'，您可以这样做：x = x[1:-1] 因为 '\n' 是一个字符。希望这对您有所帮助。

从 html 文档中提取标签内的文本

Extracting text inside tags from html document

html

python

tags

extract

beautifulsoup