无法通过 BeautifulSoup/LXML 解析 HTML
Unable to parse HTML by BeautifulSoup / LXML
我有一个 HTML 页面,我想找到其中的一些项目。
我发现很难应用 beautifulsoup 或 lxml
HTML 页数:
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
如何在列表中的不同词典中获取所有这 3 个 ID 和标题
[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]
您需要的所有标题和 ID 都位于具有 class="episode"
属性的 <span>
标签内。因此,您的工作是遍历所有这些标签并获取 div
标签的 'data-id'
及其内部 span
标签的 text
。
代码:
html = '''
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')
title_list = []
for ep in soup.find_all('div', class_='episode'):
curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
title_list.append(curr_dict)
print(title_list)
输出:
[{'id': 't1', 'title': 'Title to scrape'},
{'id': 't2', 'title': 'Title2 to scrape'},
{'id': 't3', 'title': 'Title3 to scrape'}]
或者,同样可以使用列表理解来完成:
title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]
我有一个 HTML 页面,我想找到其中的一些项目。 我发现很难应用 beautifulsoup 或 lxml
HTML 页数:
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
如何在列表中的不同词典中获取所有这 3 个 ID 和标题
[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]
您需要的所有标题和 ID 都位于具有 class="episode"
属性的 <span>
标签内。因此,您的工作是遍历所有这些标签并获取 div
标签的 'data-id'
及其内部 span
标签的 text
。
代码:
html = '''
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')
title_list = []
for ep in soup.find_all('div', class_='episode'):
curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
title_list.append(curr_dict)
print(title_list)
输出:
[{'id': 't1', 'title': 'Title to scrape'},
{'id': 't2', 'title': 'Title2 to scrape'},
{'id': 't3', 'title': 'Title3 to scrape'}]
或者,同样可以使用列表理解来完成:
title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]