从 Python BeautifulSoup 中的列表创建 html table

Creating a html table from a list in Python BeautifulSoup

我在 Python 中使用 bs4,我想从 python 中的列表中获取内容并使用 bs4 将其输入到 html 代码中,这样 html table 可以使用 requests.put() 方法发布到网站 link。 html 代码是这样的,每一行都包含标签:

<tr></tr>

每个单元格,即对应一行的每一列中的一个数据元素由标记表示:

<td></td>

所以每个数据元素都会进入 td 标签内,包围着我的 p 标签,例如:

<tr><td><p>data 1 in cell 1</p></td><td><p>data 2 in cell 2</p></td></tr>

应该放在 html table 中的数据采用列表形式,如下所示:

rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]

所以在列表中,每个元素都对应一行,每个单元格都根据“````”进行拆分,因此 1 进入第一个单元格,Jam 进入第一行的第三个单元格。 html table 字符串前面应该有 table header 并且应该以 table 页脚结尾,如下所示:

html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"

html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"

因此,构成创建 table 数据的总体 html 代码应如下所示:

<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr><tr><td><p>1</p></td><td><p>Mon, 22 Feb 2021 13:44:27 -0800</p></td><td><p>Jam</p></td><td><p>IAP-5998</p></td><td><p>10004</p></td><td><p>Model Observing a ModelIPCException</p></td><td><p>1ba4416fdd7</p></td></tr><tr><td><p>2</p></td><td><p>Mon, 30 Feb 2021 13:44:27 -0800</p></td><td><p>Rizwan</p></td><td><p>IAP-6998</p></td><td><p>10014</p></td><td><p>Model Observing</p></td><td><p>1ba4416fdd7</p></td></tr>....................................Other elements in list according to rows go here.............</tbody></table><p class=\"auto-cursor-target\"><br /></p>

这是我使用的代码:

import re
import sys
import requests
import json
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup

html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"

html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"

rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]

row_string = ""
for idx in range(0, len(rows)):
    soup = BeautifulSoup("<tr></tr>", 'html.parser')
    for cell_id in range(0, 7):
        original_tag = soup.tr
        new_tag = soup.new_tag("td")
        original_tag.append(new_tag)
        p_tag = soup.new_tag("p")
        original_tag.td.next_sibling.append(p_tag)
        original_tag.p.string = rows[idx].split("````")[cell_id]
        row_string += str(original_tag)

pass_str = html_table_header + row_string + html_table_footer
pass_string = str(pass_str).replace('\"', '\"')

headers = {
    'Content-Type': 'application/json',
}

data = '{"id":"534756378","type":"page", "title":"GL_Engine Output","space":{"key":"CSSAI"},"body":{"storage":{"value":"' + pass_string + '","representation":"storage"}}, "version":{"number":2}}'

response = requests.put('https://confluence.ai.com/rest/api/content/534756378', headers=headers, data=data,
                        auth=HTTPBasicAuth('svc-Automation@ai.com', 'AIengineering1@ai'))

但在我的代码中,只有列表中的第一个元素,即数字 1、2、3 等进入正确的单元格,但其他元素仍被插入到第一列中,因此 table当它发布到网站上时看起来不正确,因为只有 table 的 header 是正确的,但其他元素都在第一列本身中一起处理。 我查看了发布到我网站上的 rest/api html 代码,它看起来不正确,如下图所示:

我认为您可以使用 pandas 查看 table 和列表理解和拆分,在行循环中创建 table html

from pandas import read_html as rh

pd.set_option('display.expand_frame_repr', False)

html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"

html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"

rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]
body = ''

for row in rows:
    body+= '<tr>' + ''.join([f'<td><p>{i}</p></td>' for i in row.split('````')]) + '</tr>'
    
html = html_table_header + body + html_table_footer 
print(rh(html)[0])


包括 bs4(似乎有点多余):

from bs4 import BeautifulSoup as bs

soup = bs(html, 'lxml')
print(html)
print(rh(str(soup))[0])