beautifulsoup 仅选择某些值
beautifulsoup selecting certain values only
这里是部分网页源码
<tr>
<td>
<a href="/docdollars/doctors/pid/36602">
<h6>Jane</h6>
</a>
Allopathic & Osteopathic Physicians/Internal Medicine
</td>
<td>
<p>NY Medical Ctr<br>New York City,
<a href="/docdollars/states/NY">NY</a>
</p>
</td>
</tr>
<tr>
<td>
<a href="/docdollars/doctors/pid/1091514">
<h6>Greg</h6>
</a>
Allopathic & Osteopathic Physicians/Family Medicine
</td>
<td>
<p>57950 NYC<br>New York City,
<a href="/docdollars/states/NY">NY</a>
</p>
</td>
</tr>
我希望抓取的数据看起来像这样:
Jane, Allopathic & Osteopathic Physicians/Internal Medicine, NY Medical Ctr, New York City, NY
Greg, Allopathic & Osteopathic Physicians/Family Medicine, 57950 NYC, New York City, NY
我的代码(下面)部分工作(见下面的评论)。
for i in item.find_all('tr'):
print i.find('a').find('h6').text #working fine
print i.find('td').next_sibling.next_sibling.find('p').text.strip() # this needs revision
print i.find('td').text.strip() # this needs revision
提前感谢您的建议!
重点查找 <h6>
元素,使用 CSS selector,然后从那里找到附带的信息:
for header in soup.select('tr td a h6'):
name = header.get_text(strip=True)
practice = header.parent.find_next_sibling(text=True).strip()
address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
print name, practice, address
所以这会找到 <tr><td><a>
包装器中包含的所有 h6
元素。从那里,我们可以回到父元素(<a>
link)并找到下一段文本,还可以找到父元素 <td>
以找到下一个 [=16] =] 包含剩余的文本。
给定您在名为 soup
的变量中的样本输入,生成:
>>> for header in soup.select('tr td a h6'):
... name = header.get_text(strip=True)
... practice = header.parent.find_next_sibling(text=True).strip()
... address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
... print name, practice, address
...
Jane Allopathic & Osteopathic Physicians/Internal Medicine NY Medical Ctr New York City, NY
Greg Allopathic & Osteopathic Physicians/Family Medicine 57950 NYC New York City, NY
这里是部分网页源码
<tr>
<td>
<a href="/docdollars/doctors/pid/36602">
<h6>Jane</h6>
</a>
Allopathic & Osteopathic Physicians/Internal Medicine
</td>
<td>
<p>NY Medical Ctr<br>New York City,
<a href="/docdollars/states/NY">NY</a>
</p>
</td>
</tr>
<tr>
<td>
<a href="/docdollars/doctors/pid/1091514">
<h6>Greg</h6>
</a>
Allopathic & Osteopathic Physicians/Family Medicine
</td>
<td>
<p>57950 NYC<br>New York City,
<a href="/docdollars/states/NY">NY</a>
</p>
</td>
</tr>
我希望抓取的数据看起来像这样:
Jane, Allopathic & Osteopathic Physicians/Internal Medicine, NY Medical Ctr, New York City, NY
Greg, Allopathic & Osteopathic Physicians/Family Medicine, 57950 NYC, New York City, NY
我的代码(下面)部分工作(见下面的评论)。
for i in item.find_all('tr'):
print i.find('a').find('h6').text #working fine
print i.find('td').next_sibling.next_sibling.find('p').text.strip() # this needs revision
print i.find('td').text.strip() # this needs revision
提前感谢您的建议!
重点查找 <h6>
元素,使用 CSS selector,然后从那里找到附带的信息:
for header in soup.select('tr td a h6'):
name = header.get_text(strip=True)
practice = header.parent.find_next_sibling(text=True).strip()
address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
print name, practice, address
所以这会找到 <tr><td><a>
包装器中包含的所有 h6
元素。从那里,我们可以回到父元素(<a>
link)并找到下一段文本,还可以找到父元素 <td>
以找到下一个 [=16] =] 包含剩余的文本。
给定您在名为 soup
的变量中的样本输入,生成:
>>> for header in soup.select('tr td a h6'):
... name = header.get_text(strip=True)
... practice = header.parent.find_next_sibling(text=True).strip()
... address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
... print name, practice, address
...
Jane Allopathic & Osteopathic Physicians/Internal Medicine NY Medical Ctr New York City, NY
Greg Allopathic & Osteopathic Physicians/Family Medicine 57950 NYC New York City, NY