使用 beautifulsoup 进行网页抓取：分隔值

Question

我使用 beautifulsoup 进行网络抓取。该网页有以下来源：

<a href="/en/Members/">
                            Courtney, John  (Dem)                       </a>,
<a href="/en/Members/">
                            Clinton, Hilary  (Dem)                      </a>,
<a href="/en/Members/">
                            Lee, Kevin  (Rep)                       </a>,

以下代码有效。

for item in soup.find_all("a"):
    print item

但是，代码return如下：

Courtney, John  (Dem)
Clinton, Hilary  (Dem)
Lee, Kevin  (Rep)

我可以只收集名字吗？那么隶属信息分开呢？提前致谢。

Answer 1

您可以使用 re.split() 通过制作要拆分的正则表达式模式来拆分多个分隔符上的字符串。在这里我分开 ( 和 )

import re

for item in soup.find_all("a"):
    tokens = re.split('\(|\)', item)
    name = tokens[0].strip()
    affiliation = tokens[1].strip()
    print name
    print affiliation

来源：https://docs.python.org/2/library/re.html#re.split

re.split() 将 return 一个如下所示的列表：

>>> re.split('\(|\)', item)
['Courtney, John  ', 'Dem', '']

从名称列表中抓取条目 0，从末端去除白色 space。为隶属关系获取条目 1，做同样的事情。

Answer 2

您可以使用：

from bs4 import BeautifulSoup

content = '''
<a href="/en/Members/">Courtney, John  (Dem)</a>
<a href="/en/Members/">Clinton, Hilary  (Dem)</a>,
<a href="/en/Members/">Lee, Kevin  (Rep)</a>
'''

politicians = []
soup = BeautifulSoup(content)
for item in soup.find_all('a'):
    name, party = item.text.strip().rsplit('(')
    politicians.append((name.strip(), party.strip()[:-1]))

由于姓名和隶属信息都构成了a标签的文本内容，不能单独收集。您必须将它们作为一个字符串收集在一起，然后将它们分开。我使用 strip() 函数删除不需要的空格，并使用 rsplit('(') 函数在出现左括号时拆分文本内容。

输出

print(politicians)
[(u'Courtney, John', u'Dem)'),
 (u'Clinton, Hilary', u'Dem)'),
 (u'Lee, Kevin', u'Rep)')]

使用 beautifulsoup 进行网页抓取：分隔值

web scraping using beautifulsoup: separating values

python

beautifulsoup

python-2.7