获取所有 <div> 个兄弟姐妹,直到下一个 <div> 具有 lxml.thml 和 XPath 的特定文本
Get all <div> siblings until next <div> with a specific text with lxml.thml and XPath
我有一个 HTML 文件,其内容如下:
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
所以我需要一个 XPath 表达式来获取每个文件的所有文本div
下面是我写的
from lxml import html
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
tree = html.fromstring(h)
files_div = tree.xpath(r"//div[contains(text(),'File:'")
files = dict()
for file_div in files_div:
files[file_div] = file_div.xpath(r".following_sibling[not(contains(text(),'File')) and contains(text(),'Text')])
但是,使用前面的 XPath 表达式,它获取了所有文件的所有文本,而我只想获取匹配文件的文本。 XPath 表达式如何?
谢谢
您可以使用
/*/div[contains(text(), 'File:')][1]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=1]
此 XPath 选择第一个包含 File:
.
的元素后包含单词 Text
的所有 DIV 个元素
第二个文件使用
/*/div[contains(text(), 'File:')][2]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=2]
等等。
因此遍历包含 File:
.
的元素数
对于这样的问题,我建议使用 BeautifulSoup。
解决方案是:
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
files = {}
x = soup.find('div')
current_file = ''
while True:
if 'File:' in x.text:
current_file = x.text
files[current_file] = []
else:
files[current_file].append(x.text)
x = x.find_next_sibling('div')
if x is None:
break
您可以将 BeautifulSoup
与 str.split
一起使用:
from bs4 import BeautifulSoup as soup
r = [b for _, b in map(lambda x:x.text.split(': '), soup(d, 'html.parser').find_all('div'))]
输出:
['NameFile1', 'some text', 'another text', 'another text', 'NameFile2', 'some text', 'another text', 'another text']
bs4 4.7.1 使用 :contains 进行过滤非常简单
如果你想要整个标签:
from bs4 import BeautifulSoup as bs
html = '''<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
soup = bs(html, 'lxml')
search_term = 'File: '
files_div = [i.text.replace(search_term,'') for i in soup.select(f'div:contains("{search_term}")')]
files = dict()
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')
else:
files[file_div] = soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')
print(files)
如果您只想 .text
每个标签
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')]
else:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')]
我有一个 HTML 文件,其内容如下:
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
所以我需要一个 XPath 表达式来获取每个文件的所有文本div
下面是我写的
from lxml import html
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
tree = html.fromstring(h)
files_div = tree.xpath(r"//div[contains(text(),'File:'")
files = dict()
for file_div in files_div:
files[file_div] = file_div.xpath(r".following_sibling[not(contains(text(),'File')) and contains(text(),'Text')])
但是,使用前面的 XPath 表达式,它获取了所有文件的所有文本,而我只想获取匹配文件的文本。 XPath 表达式如何?
谢谢
您可以使用
/*/div[contains(text(), 'File:')][1]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=1]
此 XPath 选择第一个包含 File:
.
Text
的所有 DIV 个元素
第二个文件使用
/*/div[contains(text(), 'File:')][2]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=2]
等等。
因此遍历包含 File:
.
对于这样的问题,我建议使用 BeautifulSoup。
解决方案是:
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
files = {}
x = soup.find('div')
current_file = ''
while True:
if 'File:' in x.text:
current_file = x.text
files[current_file] = []
else:
files[current_file].append(x.text)
x = x.find_next_sibling('div')
if x is None:
break
您可以将 BeautifulSoup
与 str.split
一起使用:
from bs4 import BeautifulSoup as soup
r = [b for _, b in map(lambda x:x.text.split(': '), soup(d, 'html.parser').find_all('div'))]
输出:
['NameFile1', 'some text', 'another text', 'another text', 'NameFile2', 'some text', 'another text', 'another text']
bs4 4.7.1 使用 :contains 进行过滤非常简单
如果你想要整个标签:
from bs4 import BeautifulSoup as bs
html = '''<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
soup = bs(html, 'lxml')
search_term = 'File: '
files_div = [i.text.replace(search_term,'') for i in soup.select(f'div:contains("{search_term}")')]
files = dict()
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')
else:
files[file_div] = soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')
print(files)
如果您只想 .text
每个标签
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')]
else:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')]