使用 python 根据标题标签自动生成嵌套的 table 内容
Automatically generate nested table of contents based on heading tags using python
我正在尝试根据 HTML 的标题标签创建嵌套的 table 内容。
我的 HTML 文件:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<h1>
My report Name
</h1>
<h1 id="2">First Chapter </h1>
<h2 id="3"> First Sub-chapter of the first chapter</h2>
<ul>
<h1 id="text1">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="4">Second Sub-chapter of the first chapter </h2>
<ul>
<h1 id="text2">Useless h1</h1>
<p>
some text
</p>
</ul>
<h1 id="5">Second Chapter </h1>
<h2 id="6">First Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="7">Second Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
</body>
</html>
我的python代码:
import from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
f = codecs.open("C:\x\test.html", 'r')
page = f.read()
f.close()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1",{"id": False}).text.strip()
print("the name of the report is : " + ref + " \n")
chapters = page_soup.findAll('h1', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(chapters)) + " chapter(s)")
for index, chapter in enumerate(chapters):
print(str(index+1) +"-" + str(chapter.text.strip()) + "\n")
sub_chapters = page_soup.findAll('h2', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(sub_chapters)) + " sub_chapter(s)")
for index, sub_chapter in enumerate(sub_chapters):
print(str(index+1) +"-" +str(sub_chapter.text.strip()) + "\n")
使用此代码,我可以获得所有章节和所有 sub-chapters 但这不是我的目标。
我的目标是将以下内容作为我的 table 内容:
1-First Chapter
1-First sub-chapter of the first chapter
2-Second sub-chapter of the first chapter
2-Second Chapter
1-First sub-chapter of the Second chapter
2-Second sub-chapter of the Second chapter
关于如何实现我想要的 table 内容格式的任何建议或想法?
您可以在找到与每个章节相关的所有数据后使用itertools.groupby
:
from itertools import groupby, count
import re
from bs4 import BeautifulSoup as soup
data = [[i.name, re.sub('\s+$', '', i.text)] for i in soup(content, 'html.parser').find_all(re.compile('h1|h2'), {'id':re.compile('^\d+$')})]
grouped, _count = [[a, list(b)] for a, b in groupby(data, key=lambda x:x[0] == 'h1')], count(1)
new_grouped = [[grouped[i][-1][0][-1], [c for _, c in grouped[i+1][-1]]] for i in range(0, len(grouped), 2)]
final_string = '\n'.join(f'{next(_count)}-{a}\n'+'\n'.join(f'\t{i}-{c}' for i, c in enumerate(b, 1)) for a, b in new_grouped)
print(final_string)
输出:
1-First Chapter
1- First Sub-chapter of the first chapter
2-Second Sub-chapter of the first chapter
2-Second Chapter
1-First Sub-chapter of the Second chapter
2-Second Sub-chapter of the Second chapter
如果您愿意将 HTML 布局更改为类似于以下内容:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<article>
<h1>
My report Name
</h1>
<section>
<h2 id="chapter-one">First Chapter</h2>
<section>
<h3 id="one-one"> First Sub-chapter of the first chapter</h3>
<ul>
<h4 id="text1">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
<section>
<h3 id="one-two">Second Sub-chapter of the first chapter</h3>
<ul>
<h4 id="text2">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
</section>
<section>
<h2 id="chapter-two">Second Chapter </h2>
<section>
<h3 id="two-one">First Sub-chapter of the Second chapter</h3>
<ul>
<h4 id="text6">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
<section>
<h3 id="two-two">Second Sub-chapter of the Second chapter</h3>
<ul>
<h4 id="text6">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
</section>
</article>
</body>
</html>
那么你的Python代码就变得更简单了:
from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
with codecs.open("index.html", 'r') as f:
page = f.read()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1").text.strip()
print("the name of the report is : " + ref + " \n")
chapters = page_soup.findAll('h2')
for index, chapter in enumerate(chapters):
print(str(index+1) +"-" + str(chapter.text.strip()))
sub_chapters = chapter.find_parent().find_all("h3")
for index2, sub_chapter in enumerate(sub_chapters):
print("\t" + str(index2+1) +"-" +str(sub_chapter.text.strip()))
我稍微更新了页面阅读代码,并尝试在更新后的脚本中使用更多地道的 python。
另外,请注意:
sub_chapters = chapter.find_parent().find_all("h3")
find_all 是相对于章节的父级而不是整个文档
我正在尝试根据 HTML 的标题标签创建嵌套的 table 内容。
我的 HTML 文件:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<h1>
My report Name
</h1>
<h1 id="2">First Chapter </h1>
<h2 id="3"> First Sub-chapter of the first chapter</h2>
<ul>
<h1 id="text1">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="4">Second Sub-chapter of the first chapter </h2>
<ul>
<h1 id="text2">Useless h1</h1>
<p>
some text
</p>
</ul>
<h1 id="5">Second Chapter </h1>
<h2 id="6">First Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="7">Second Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
</body>
</html>
我的python代码:
import from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
f = codecs.open("C:\x\test.html", 'r')
page = f.read()
f.close()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1",{"id": False}).text.strip()
print("the name of the report is : " + ref + " \n")
chapters = page_soup.findAll('h1', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(chapters)) + " chapter(s)")
for index, chapter in enumerate(chapters):
print(str(index+1) +"-" + str(chapter.text.strip()) + "\n")
sub_chapters = page_soup.findAll('h2', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(sub_chapters)) + " sub_chapter(s)")
for index, sub_chapter in enumerate(sub_chapters):
print(str(index+1) +"-" +str(sub_chapter.text.strip()) + "\n")
使用此代码,我可以获得所有章节和所有 sub-chapters 但这不是我的目标。
我的目标是将以下内容作为我的 table 内容:
1-First Chapter
1-First sub-chapter of the first chapter
2-Second sub-chapter of the first chapter
2-Second Chapter
1-First sub-chapter of the Second chapter
2-Second sub-chapter of the Second chapter
关于如何实现我想要的 table 内容格式的任何建议或想法?
您可以在找到与每个章节相关的所有数据后使用itertools.groupby
:
from itertools import groupby, count
import re
from bs4 import BeautifulSoup as soup
data = [[i.name, re.sub('\s+$', '', i.text)] for i in soup(content, 'html.parser').find_all(re.compile('h1|h2'), {'id':re.compile('^\d+$')})]
grouped, _count = [[a, list(b)] for a, b in groupby(data, key=lambda x:x[0] == 'h1')], count(1)
new_grouped = [[grouped[i][-1][0][-1], [c for _, c in grouped[i+1][-1]]] for i in range(0, len(grouped), 2)]
final_string = '\n'.join(f'{next(_count)}-{a}\n'+'\n'.join(f'\t{i}-{c}' for i, c in enumerate(b, 1)) for a, b in new_grouped)
print(final_string)
输出:
1-First Chapter
1- First Sub-chapter of the first chapter
2-Second Sub-chapter of the first chapter
2-Second Chapter
1-First Sub-chapter of the Second chapter
2-Second Sub-chapter of the Second chapter
如果您愿意将 HTML 布局更改为类似于以下内容:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<article>
<h1>
My report Name
</h1>
<section>
<h2 id="chapter-one">First Chapter</h2>
<section>
<h3 id="one-one"> First Sub-chapter of the first chapter</h3>
<ul>
<h4 id="text1">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
<section>
<h3 id="one-two">Second Sub-chapter of the first chapter</h3>
<ul>
<h4 id="text2">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
</section>
<section>
<h2 id="chapter-two">Second Chapter </h2>
<section>
<h3 id="two-one">First Sub-chapter of the Second chapter</h3>
<ul>
<h4 id="text6">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
<section>
<h3 id="two-two">Second Sub-chapter of the Second chapter</h3>
<ul>
<h4 id="text6">Useless h4</h4>
<p>
some text
</p>
</ul>
</section>
</section>
</article>
</body>
</html>
那么你的Python代码就变得更简单了:
from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
with codecs.open("index.html", 'r') as f:
page = f.read()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1").text.strip()
print("the name of the report is : " + ref + " \n")
chapters = page_soup.findAll('h2')
for index, chapter in enumerate(chapters):
print(str(index+1) +"-" + str(chapter.text.strip()))
sub_chapters = chapter.find_parent().find_all("h3")
for index2, sub_chapter in enumerate(sub_chapters):
print("\t" + str(index2+1) +"-" +str(sub_chapter.text.strip()))
我稍微更新了页面阅读代码,并尝试在更新后的脚本中使用更多地道的 python。
另外,请注意:
sub_chapters = chapter.find_parent().find_all("h3")
find_all 是相对于章节的父级而不是整个文档