仅在带有特定文本的标签之后查找特定 class 的所有标签

Question

我在 HTML 中有一个很长的 table，因此标签没有相互嵌套。它看起来像这样：

<tr>
    <td>A</td>
</tr>
<tr>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
</tr>
<tr>
    <td class ="y">...</td>
    <td class ="y">...</td>
    <td class ="y">...</td>
    <td class ="y">...</td>
</tr>
<tr>
    <td>B</td>
</tr>
<tr>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
</tr>
<tr>
    <td class ="y">I want this</td>
    <td class ="y">and this</td>
    <td class ="y">and this</td>
    <td class ="y">and this</td>
</tr>

所以首先我想搜索树以找到 "B"。然后我想在 B 之后但在 table 的下一行以 "C".

开始之前用 class y 获取每个 td 标签的文本

我试过这个：

results = soup.find_all('td')
for result in results:
    if result.string == "B":
        print(result.string)

这得到了我想要的字符串 B。但现在我正在努力寻找所有的东西，但我没有得到我想要的。

for results in soup.find_all('td'):
    if results.string == 'B':
        a = results.find_next('td',class_='y')

这给了我 'B' 之后的下一个 td，这是我想要的，但我似乎只能得到第一个 td 标签。我想在 'B' 之后但在 'C' 之前获取所有具有 class y 的标签（C 未显示在 html 中，但遵循相同的模式），我想将其添加到列表中。

我的结果列表是：

[['I want this'],['and this'],['and this'],['and this']]

Answer 1

基本上，您需要找到包含 B 文本的元素。这是你的起点。

然后，使用 find_next_siblings():

检查此元素的每个 tr 兄弟姐妹

start = soup.find("td", text="B").parent
for tr in start.find_next_siblings("tr"):
    # exit if reached C
    if tr.find("td", text="C"):
        break

    # get all tds with a desired class
    tds = tr.find_all("td", class_="y")
    for td in tds:
        print(td.get_text())

在您的示例数据上测试，它打印：

I want this
and this
and this
and this

仅在带有特定文本的标签之后查找特定 class 的所有标签

Find all tags of certain class only after tag with certain text

html

python

beautifulsoup

html-parsing