Python:已保存 html 页面的 open() 文件函数会截断长行?

Python: open() file function for saved html pages truncates long lines?

我正在使用 beautiful soup 解析保存的 HTML 文件,在下面找到示例,起初我以为 beautiful soup 会截断长行,但显然它是 open 函数

<!DOCTYPE html>
<html dir="ltr" lang="en-GB">
<head>
<meta charset="utf-8" />
<title>The Title </title>

<meta property="type" content="website" />

<meta property="description" content="This is the text I want, but if its too long it gets truncated"/>
</head>

我想获取内容标签中的文本,其中 proprty=description,我编写的代码工作正常,但是当内容中的文本太长时,它会被截断,我想将文本保存在一个变量中,任何关于如何避免截断以保存整个文本的想法

def parse_page(file_path):
    page = open(file_path)
    soup = BeautifulSoup(page.read()) 
    for line in page: #----> here when printing long lines are truncated thus problems when saving in variable answer
      print(line) 
    soup = BeautifulSoup(fp, "html.parser")
    answer=soup.find(property="description") #---->truncated output saved
    print('answer--->',answer['content'],'type',type(answer)) #---> when printing its truncated 

这是调用函数的代码块:

path='/content/HTMLpages'
os.chdir(path)

for file in os.listdir():
    file_path = f"{path}/{file}"
    parse_page(file_path)

对于给定的网站:https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-

这是一个完全有效的代码:

from bs4 import BeautifulSoup
import cloudscraper

def parse_page(HTML):
    soup = BeautifulSoup(HTML, "html.parser")
    # print(soup)
    # print([tag.name for tag in soup.find_all()])
    answer= soup.find_all('ol')
    print(answer)

url = 'https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-'

scraper = cloudscraper.create_scraper()  
page_html = scraper.get(url).text  
print("HTML fetched. Calling BS4")
parse_page(page_html)

输出

HTML fetched. Calling BS4
[<ol class="breadcrumbs">
<li title="Shell Support ">
<a href="/hc/en-gb">Shell Support </a>
</li>
<li title="Shell App">
<a href="/hc/en-gb/categories/115000345732-Shell-App">Shell App</a>
</li>
<li title="General">
<a href="/hc/en-gb/sections/115000744231-General">General</a>
</li>
</ol>, <ol>
<li>Open the <a href="https://itunes.apple.com/gb/app/shell/id484782414?mt=8">Apple iTunes</a> store for <strong>iOS</strong> devices or <a href="https://play.google.com/store/apps/details?id=com.shell.sitibv.motorist&amp;hl=en_GB">Google Play</a> store for <strong>Android</strong> devices</li>
<li>Search for <strong>Shell - </strong>in iTunes for iOS and Google Play for Android    <strong>       <br/></strong></li>
<li><strong>Install</strong> to add the app to your device</li>
<li>Find the <strong>Shell app</strong> on your device then open to register your details and get started.</li>

必须使用 cloudscraper 库来绕过 Cloudflare,但这并不重要。

BeautifulSoup 能够完美地解析整个 HTML。正如你提到的捕获点并且它们在 ordered-list 标签中,我也添加了那部分。

希望此代码示例可以帮助您了解 bs4 的工作原理,并有助于澄清您的任何误解。