Python：已保存 html 页面的 open() 文件函数会截断长行？

Question

我正在使用 beautiful soup 解析保存的 HTML 文件，在下面找到示例，起初我以为 beautiful soup 会截断长行，但显然它是 open 函数

<!DOCTYPE html>
<html dir="ltr" lang="en-GB">
<head>
<meta charset="utf-8" />
<title>The Title </title>

<meta property="type" content="website" />

<meta property="description" content="This is the text I want, but if its too long it gets truncated"/>
</head>

我想获取内容标签中的文本，其中 proprty=description，我编写的代码工作正常，但是当内容中的文本太长时，它会被截断，我想将文本保存在一个变量中，任何关于如何避免截断以保存整个文本的想法

def parse_page(file_path):
    page = open(file_path)
    soup = BeautifulSoup(page.read()) 
    for line in page: #----> here when printing long lines are truncated thus problems when saving in variable answer
      print(line) 
    soup = BeautifulSoup(fp, "html.parser")
    answer=soup.find(property="description") #---->truncated output saved
    print('answer--->',answer['content'],'type',type(answer)) #---> when printing its truncated

这是调用函数的代码块：

path='/content/HTMLpages'
os.chdir(path)

for file in os.listdir():
    file_path = f"{path}/{file}"
    parse_page(file_path)

Answer 1

对于给定的网站：https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-

这是一个完全有效的代码:

from bs4 import BeautifulSoup
import cloudscraper

def parse_page(HTML):
    soup = BeautifulSoup(HTML, "html.parser")
    # print(soup)
    # print([tag.name for tag in soup.find_all()])
    answer= soup.find_all('ol')
    print(answer)

url = 'https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-'

scraper = cloudscraper.create_scraper()  
page_html = scraper.get(url).text  
print("HTML fetched. Calling BS4")
parse_page(page_html)

输出

HTML fetched. Calling BS4
[<ol class="breadcrumbs">
<li title="Shell Support ">
<a href="/hc/en-gb">Shell Support </a>
</li>
<li title="Shell App">
<a href="/hc/en-gb/categories/115000345732-Shell-App">Shell App</a>
</li>
<li title="General">
<a href="/hc/en-gb/sections/115000744231-General">General</a>
</li>
</ol>, <ol>
<li>Open the <a href="https://itunes.apple.com/gb/app/shell/id484782414?mt=8">Apple iTunes</a> store for <strong>iOS</strong> devices or <a href="https://play.google.com/store/apps/details?id=com.shell.sitibv.motorist&amp;hl=en_GB">Google Play</a> store for <strong>Android</strong> devices</li>
<li>Search for <strong>Shell - </strong>in iTunes for iOS and Google Play for Android    <strong>       <br/></strong></li>
<li><strong>Install</strong> to add the app to your device</li>
<li>Find the <strong>Shell app</strong> on your device then open to register your details and get started.</li>

必须使用 cloudscraper 库来绕过 Cloudflare，但这并不重要。

BeautifulSoup 能够完美地解析整个 HTML。正如你提到的捕获点并且它们在 ordered-list 标签中，我也添加了那部分。

希望此代码示例可以帮助您了解 bs4 的工作原理，并有助于澄清您的任何误解。

Python：已保存 html 页面的 open() 文件函数会截断长行？

Python: open() file function for saved html pages truncates long lines?

html

python

filesystems