使用漂亮的汤从 HTML 中提取特定的 header

Question

这是我使用的专利示例https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry。下面是我使用的代码。我希望代码仅显示 (3) 的引用次数，这样我就知道该专利被引用了多少次 cited.How 我能否让输出显示仅将引用次数显示为 3？请帮忙！

 
soup = BeautifulSoup(patent, 'html.parser')
cited_section =soup.findAll({"h2":"Cited By"})

print(cited_section)
Output I get is [<h2>Info</h2>, <h2>Links</h2>, <h2>Images</h2>, <h2>Classifications</h2>, <h2>Abstract</h2>, <h2>Description</h2>, <h2>Claims (<span itemprop="count">57</span>)</h2>, <h2>Priority Applications (5)</h2>, <h2>Applications Claiming Priority (1)</h2>, <h2>Related Parent Applications (1)</h2>, <h2>Publications (2)</h2>, <h2>ID=38925605</h2>, <h2>Family Applications (1)</h2>, <h2>Country Status (1)</h2>, <h2>Cited By (3)</h2>, <h2>Families Citing this family (12)</h2>, <h2>Citations (306)</h2>, <h2>Patent Citations (348)</h2>, <h2>Non-Patent Citations (23)</h2>, <h2>Cited By (4)</h2>, <h2>Also Published As</h2>, <h2>Similar Documents</h2>, <h2>Legal Events</h2>]````

Answer 1

引用次数是通过JavaScript动态创建的。但是您可以使用 itemprop="forwardReferencesFamily" 计算元素的数量以获得计数。例如：

import requests
from bs4 import BeautifulSoup


url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print(len(soup.select('tr[itemprop="forwardReferencesFamily"]')))

打印：

Answer 2

大家好 link https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry 我想要代码打印专利引文，其中应该给出出版号、标题。然后我想使用 pandas 将出版号放在一列中，将标题放在另一列中。到目前为止，我已经使用漂亮的汤将 HTML 文件转换为可读的 format.I 已经选择了向后引用 HTML 标签，我希望它在该标签下打印引用的出版号和标题.我只举了一个例子，但我有一个文件夹，里面装满了 HTML 个文件，稍后我会做这些文件。

x=soup.select('tr[itemprop="backwardReferences"]') 
y=soup.select('td[itemprop="title"]') # this line gives all the titles in the document not particularly under the patent citations
print(y)

使用漂亮的汤从 HTML 中提取特定的 header

Extract a specific header from HTML using beautiful soup

html

python

parsing

extract

beautifulsoup