Python:已保存 html 页面的 open() 文件函数会截断长行?
Python: open() file function for saved html pages truncates long lines?
我正在使用 beautiful soup 解析保存的 HTML 文件,在下面找到示例,起初我以为 beautiful soup 会截断长行,但显然它是 open 函数
<!DOCTYPE html>
<html dir="ltr" lang="en-GB">
<head>
<meta charset="utf-8" />
<title>The Title </title>
<meta property="type" content="website" />
<meta property="description" content="This is the text I want, but if its too long it gets truncated"/>
</head>
我想获取内容标签中的文本,其中 proprty=description,我编写的代码工作正常,但是当内容中的文本太长时,它会被截断,我想将文本保存在一个变量中,任何关于如何避免截断以保存整个文本的想法
def parse_page(file_path):
page = open(file_path)
soup = BeautifulSoup(page.read())
for line in page: #----> here when printing long lines are truncated thus problems when saving in variable answer
print(line)
soup = BeautifulSoup(fp, "html.parser")
answer=soup.find(property="description") #---->truncated output saved
print('answer--->',answer['content'],'type',type(answer)) #---> when printing its truncated
这是调用函数的代码块:
path='/content/HTMLpages'
os.chdir(path)
for file in os.listdir():
file_path = f"{path}/{file}"
parse_page(file_path)
对于给定的网站:https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-
这是一个完全有效的代码:
from bs4 import BeautifulSoup
import cloudscraper
def parse_page(HTML):
soup = BeautifulSoup(HTML, "html.parser")
# print(soup)
# print([tag.name for tag in soup.find_all()])
answer= soup.find_all('ol')
print(answer)
url = 'https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-'
scraper = cloudscraper.create_scraper()
page_html = scraper.get(url).text
print("HTML fetched. Calling BS4")
parse_page(page_html)
输出
HTML fetched. Calling BS4
[<ol class="breadcrumbs">
<li title="Shell Support ">
<a href="/hc/en-gb">Shell Support </a>
</li>
<li title="Shell App">
<a href="/hc/en-gb/categories/115000345732-Shell-App">Shell App</a>
</li>
<li title="General">
<a href="/hc/en-gb/sections/115000744231-General">General</a>
</li>
</ol>, <ol>
<li>Open the <a href="https://itunes.apple.com/gb/app/shell/id484782414?mt=8">Apple iTunes</a> store for <strong>iOS</strong> devices or <a href="https://play.google.com/store/apps/details?id=com.shell.sitibv.motorist&hl=en_GB">Google Play</a> store for <strong>Android</strong> devices</li>
<li>Search for <strong>Shell - </strong>in iTunes for iOS and Google Play for Android <strong> <br/></strong></li>
<li><strong>Install</strong> to add the app to your device</li>
<li>Find the <strong>Shell app</strong> on your device then open to register your details and get started.</li>
必须使用 cloudscraper
库来绕过 Cloudflare,但这并不重要。
BeautifulSoup 能够完美地解析整个 HTML。正如你提到的捕获点并且它们在 ordered-list 标签中,我也添加了那部分。
希望此代码示例可以帮助您了解 bs4 的工作原理,并有助于澄清您的任何误解。
我正在使用 beautiful soup 解析保存的 HTML 文件,在下面找到示例,起初我以为 beautiful soup 会截断长行,但显然它是 open 函数
<!DOCTYPE html>
<html dir="ltr" lang="en-GB">
<head>
<meta charset="utf-8" />
<title>The Title </title>
<meta property="type" content="website" />
<meta property="description" content="This is the text I want, but if its too long it gets truncated"/>
</head>
我想获取内容标签中的文本,其中 proprty=description,我编写的代码工作正常,但是当内容中的文本太长时,它会被截断,我想将文本保存在一个变量中,任何关于如何避免截断以保存整个文本的想法
def parse_page(file_path):
page = open(file_path)
soup = BeautifulSoup(page.read())
for line in page: #----> here when printing long lines are truncated thus problems when saving in variable answer
print(line)
soup = BeautifulSoup(fp, "html.parser")
answer=soup.find(property="description") #---->truncated output saved
print('answer--->',answer['content'],'type',type(answer)) #---> when printing its truncated
这是调用函数的代码块:
path='/content/HTMLpages'
os.chdir(path)
for file in os.listdir():
file_path = f"{path}/{file}"
parse_page(file_path)
对于给定的网站:https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-
这是一个完全有效的代码:
from bs4 import BeautifulSoup
import cloudscraper
def parse_page(HTML):
soup = BeautifulSoup(HTML, "html.parser")
# print(soup)
# print([tag.name for tag in soup.find_all()])
answer= soup.find_all('ol')
print(answer)
url = 'https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-'
scraper = cloudscraper.create_scraper()
page_html = scraper.get(url).text
print("HTML fetched. Calling BS4")
parse_page(page_html)
输出
HTML fetched. Calling BS4
[<ol class="breadcrumbs">
<li title="Shell Support ">
<a href="/hc/en-gb">Shell Support </a>
</li>
<li title="Shell App">
<a href="/hc/en-gb/categories/115000345732-Shell-App">Shell App</a>
</li>
<li title="General">
<a href="/hc/en-gb/sections/115000744231-General">General</a>
</li>
</ol>, <ol>
<li>Open the <a href="https://itunes.apple.com/gb/app/shell/id484782414?mt=8">Apple iTunes</a> store for <strong>iOS</strong> devices or <a href="https://play.google.com/store/apps/details?id=com.shell.sitibv.motorist&hl=en_GB">Google Play</a> store for <strong>Android</strong> devices</li>
<li>Search for <strong>Shell - </strong>in iTunes for iOS and Google Play for Android <strong> <br/></strong></li>
<li><strong>Install</strong> to add the app to your device</li>
<li>Find the <strong>Shell app</strong> on your device then open to register your details and get started.</li>
必须使用 cloudscraper
库来绕过 Cloudflare,但这并不重要。
BeautifulSoup 能够完美地解析整个 HTML。正如你提到的捕获点并且它们在 ordered-list 标签中,我也添加了那部分。
希望此代码示例可以帮助您了解 bs4 的工作原理,并有助于澄清您的任何误解。