获取所有 <div> 个兄弟姐妹,直到下一个 <div> 具有 lxml.thml 和 XPath 的特定文本

Get all <div> siblings until next <div> with a specific text with lxml.thml and XPath

我有一个 HTML 文件,其内容如下:

<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>

所以我需要一个 XPath 表达式来获取每个文件的所有文本div

下面是我写的

from lxml import html
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
tree = html.fromstring(h)
files_div = tree.xpath(r"//div[contains(text(),'File:'")
files = dict()
for file_div in files_div:
    files[file_div] = file_div.xpath(r".following_sibling[not(contains(text(),'File')) and contains(text(),'Text')])

但是,使用前面的 XPath 表达式,它获取了所有文件的所有文本,而我只想获取匹配文件的文本。 XPath 表达式如何?

谢谢

您可以使用

/*/div[contains(text(), 'File:')][1]/following-sibling::div[contains(text(), 'Text')  and count(preceding-sibling::div[contains(text(), 'File:')])=1]

此 XPath 选择第一个包含 File:.

的元素后包含单词 Text 的所有 DIV 个元素

第二个文件使用

/*/div[contains(text(), 'File:')][2]/following-sibling::div[contains(text(), 'Text')  and count(preceding-sibling::div[contains(text(), 'File:')])=2]

等等。 因此遍历包含 File:.

的元素数

对于这样的问题,我建议使用 BeautifulSoup

解决方案是:

h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(h)

files = {}
x = soup.find('div')
current_file = ''
while True:
    if 'File:' in x.text:
        current_file = x.text
        files[current_file] = []
    else:
        files[current_file].append(x.text)

    x = x.find_next_sibling('div')
    if x is None:
        break


您可以将 BeautifulSoupstr.split 一起使用:

from bs4 import BeautifulSoup as soup
r = [b for _, b in map(lambda x:x.text.split(': '), soup(d, 'html.parser').find_all('div'))]

输出:

['NameFile1', 'some text', 'another text', 'another text', 'NameFile2', 'some text', 'another text', 'another text']

bs4 4.7.1 使用 :contains 进行过滤非常简单

如果你想要整个标签:

from bs4 import BeautifulSoup as bs

html = '''<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''

soup = bs(html, 'lxml')
search_term = 'File: '
files_div = [i.text.replace(search_term,'') for i in soup.select(f'div:contains("{search_term}")')]
files = dict()

for number, file_div in enumerate(files_div):
    if file_div != files_div[-1]:
        files[file_div] = soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')
    else:
        files[file_div] = soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')

print(files) 

如果您只想 .text 每个标签

for number, file_div in enumerate(files_div):
    if file_div != files_div[-1]:
        files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')]
    else:
        files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')]