使用 python 解析 HTML 的奇怪结构 - 第二个版本
parsing weird structure of the HTML using python - second ver
我昨天问了一个问题,理解的很清楚,但我现在有一个更棘手的问题。
首先,显示这个html结构我想要解析的内容
<body>
<div id="links">
<a href='url1'>apple-explain</a>
<blackquote>
<a href='url1'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='batch_url1'>bear-explain</a>
<blackquote>
<a href='url2'>link-1</a>
<a href='url3'>link-2</a>
</blackquote>
</div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>cat-explain</a>
<blackquote>
<a href='url4'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='batchurl2'>duck-explain</a>
<blackquote>
<a href='url5'>link-1</a>
<a href='url6'>link-2</a>
<a href='url7'>link-3</a>
</blackquote>
</div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>egg-explain</a></div>
<blackquote>
<a href='url8'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>egg</p></div>
#etc
</body>
看起来有点长,但是结构简单
<div id="links">
<a href=url>some explain</a>
<blackquote>
<a href=url>link number</a>
</blackquote></div>
<div id="info">
<p>info keyword</p></div>
我的目的是
到"grab all urls in , delete duplications, and matching them to info keywords".
例如,苹果部分有两个,但它们是相同的 href
和 bear 部分,它有 3 个和 3 个 href,一个在
中,两个在
我想清除元组并打印
元组是
(apple, url1)
(bear, [batch_url1, url2, url3])
etc...
印刷形式是
url1 = apple
batch_url1 = bear
url2 = bear
url3 = bear
etc
这是我的代码,
soup = BeautifulSoup("""that HTML""")
url_list = soup.find_all('div', attrs={'id': 'links'})
info_list = soup.find_all('div', attrs={'id': 'links'})
for url, info in zip(url_list, info_list):
for temp in url.find_all():
infokeyword = info.text
urls = temp.attrs['href']
zipped = zip(infokeyword, urls)
d=len(infokeyword)
for n in range(0, d+1):
print(str(infokeyword[n]) + " = " + str(urls[n])
当 运行 时,结果如下:
Traceback (most recent call last):
File "D:/Users/Hyungsoo/PycharmProjects/untitled1/zx.py", line 59, in <module>
urls = temp.attrs['href']
KeyError: 'href'
我怎样才能做出这样的东西?
为了区分 url
你可以使用 collections.defaultdict
和 set
作为 default_factory.
In [72]: from collections import defaultdict
In [73]: from bs4 import BeautifulSoup
In [74]: soup = BeautifulSoup("""<body>
....: <div id="links">
....: <a href='url1'>apple-explain</a>
....: <blackquote>
....: <a href='url1'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>apple</p></div>
....:
....: <div id="links">
....: <a href='batch_url1'>bear-explain</a>
....: <blackquote>
....: <a href='url2'>link-1</a>
....: <a href='url3'>link-2</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>bear</p></div>
....:
....: <div id="links">
....: <a href='url4'>cat-explain</a>
....: <blackquote>
....: <a href='url4'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>cat</p></div>
....:
....: <div id="links">
....: <a href='batchurl2'>duck-explain</a>
....: <blackquote>
....: <a href='url5'>link-1</a>
....: <a href='url6'>link-2</a>
....: <a href='url7'>link-3</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>duck</p></div>
....:
....: <div id="links">
....: <a href='url8'>egg-explain</a></div>
....: <blackquote>
....: <a href='url8'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>egg</p></div>
....: #etc
....: </body>""")
In [75]: distinct_url = defaultdict(set)
In [76]: links = soup.select('div#links')
In [77]: infos = soup.select('div#info p')
In [78]: for k, v in zip(links, infos):
....: for l in k.find_all('a'):
....: distinct_url[v.text].add(l.attrs['href'])
....:
In [79]: distinct_url
Out[79]: defaultdict(<class 'set'>, {'apple': {'url1'}, 'duck': {'url5', 'url7', 'url6', 'batchurl2'}, 'bear': {'batch_url1', 'url3', 'url2'}, 'cat': {'url4'}, 'egg': {'url8'}})
In [80]: for info, lks in distinct_url.items():
....: for lk in lks:
....: print(info, lk)
....:
apple url1
duck url5
duck url7
duck url6
duck batchurl2
bear batch_url1
bear url3
bear url2
cat url4
egg url8
我昨天问了一个问题,理解的很清楚,但我现在有一个更棘手的问题。
首先,显示这个html结构我想要解析的内容
<body>
<div id="links">
<a href='url1'>apple-explain</a>
<blackquote>
<a href='url1'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='batch_url1'>bear-explain</a>
<blackquote>
<a href='url2'>link-1</a>
<a href='url3'>link-2</a>
</blackquote>
</div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>cat-explain</a>
<blackquote>
<a href='url4'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='batchurl2'>duck-explain</a>
<blackquote>
<a href='url5'>link-1</a>
<a href='url6'>link-2</a>
<a href='url7'>link-3</a>
</blackquote>
</div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>egg-explain</a></div>
<blackquote>
<a href='url8'>link-1</a>
</blackquote>
</div>
<div id="info">
<p>egg</p></div>
#etc
</body>
看起来有点长,但是结构简单
<div id="links">
<a href=url>some explain</a>
<blackquote>
<a href=url>link number</a>
</blackquote></div>
<div id="info">
<p>info keyword</p></div>
我的目的是
到"grab all urls in , delete duplications, and matching them to info keywords".
例如,苹果部分有两个,但它们是相同的 href 和 bear 部分,它有 3 个和 3 个 href,一个在
中,两个在我想清除元组并打印
元组是
(apple, url1)
(bear, [batch_url1, url2, url3])
etc...
印刷形式是
url1 = apple
batch_url1 = bear
url2 = bear
url3 = bear
etc
这是我的代码,
soup = BeautifulSoup("""that HTML""")
url_list = soup.find_all('div', attrs={'id': 'links'})
info_list = soup.find_all('div', attrs={'id': 'links'})
for url, info in zip(url_list, info_list):
for temp in url.find_all():
infokeyword = info.text
urls = temp.attrs['href']
zipped = zip(infokeyword, urls)
d=len(infokeyword)
for n in range(0, d+1):
print(str(infokeyword[n]) + " = " + str(urls[n])
当 运行 时,结果如下:
Traceback (most recent call last):
File "D:/Users/Hyungsoo/PycharmProjects/untitled1/zx.py", line 59, in <module>
urls = temp.attrs['href']
KeyError: 'href'
我怎样才能做出这样的东西?
为了区分 url
你可以使用 collections.defaultdict
和 set
作为 default_factory.
In [72]: from collections import defaultdict
In [73]: from bs4 import BeautifulSoup
In [74]: soup = BeautifulSoup("""<body>
....: <div id="links">
....: <a href='url1'>apple-explain</a>
....: <blackquote>
....: <a href='url1'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>apple</p></div>
....:
....: <div id="links">
....: <a href='batch_url1'>bear-explain</a>
....: <blackquote>
....: <a href='url2'>link-1</a>
....: <a href='url3'>link-2</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>bear</p></div>
....:
....: <div id="links">
....: <a href='url4'>cat-explain</a>
....: <blackquote>
....: <a href='url4'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>cat</p></div>
....:
....: <div id="links">
....: <a href='batchurl2'>duck-explain</a>
....: <blackquote>
....: <a href='url5'>link-1</a>
....: <a href='url6'>link-2</a>
....: <a href='url7'>link-3</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>duck</p></div>
....:
....: <div id="links">
....: <a href='url8'>egg-explain</a></div>
....: <blackquote>
....: <a href='url8'>link-1</a>
....: </blackquote>
....: </div>
....: <div id="info">
....: <p>egg</p></div>
....: #etc
....: </body>""")
In [75]: distinct_url = defaultdict(set)
In [76]: links = soup.select('div#links')
In [77]: infos = soup.select('div#info p')
In [78]: for k, v in zip(links, infos):
....: for l in k.find_all('a'):
....: distinct_url[v.text].add(l.attrs['href'])
....:
In [79]: distinct_url
Out[79]: defaultdict(<class 'set'>, {'apple': {'url1'}, 'duck': {'url5', 'url7', 'url6', 'batchurl2'}, 'bear': {'batch_url1', 'url3', 'url2'}, 'cat': {'url4'}, 'egg': {'url8'}})
In [80]: for info, lks in distinct_url.items():
....: for lk in lks:
....: print(info, lk)
....:
apple url1
duck url5
duck url7
duck url6
duck batchurl2
bear batch_url1
bear url3
bear url2
cat url4
egg url8