Find Next Siblings 不返回值。我怎样才能提取我需要的两个 类 而没有剩下的 类?
Find Next Siblings not returning a value. How can I extract the two classes I need without the rest of the classes?
我只想从下面的“内容”中提取商品重量和产品尺寸。我在这里错过了什么?在我的脚本中,没有找到我要查找的内容。有没有更简单的方法来提取商品重量和产品尺寸?谢谢
import bs4 as bs
content = '''
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = bs.BeautifulSoup(content, features='lxml')
try:
product = {
'weight': soup.find(text='Item Weight').parent.find_next_siblings(),
'dimension': soup.find(text='Product Dimensions').parent.find_next_siblings()
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
print(product)
回溯:
{'weight': 'item unavailable', 'dimension': 'item unavailable'}
您错误地使用了查找下一个兄弟姐妹。 td
标签是 th
标签的同级标签,而不是父 tr
标签的同级标签。
from bs4 import BeautifulSoup
import re
content = '''
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = BeautifulSoup(content, 'html.parser')
d = {
'weight': soup.find('th', text=re.compile('\s*Item Weight\s*')).find_next_sibling('td').text.strip(),
'dimension': soup.find('th', text=re.compile('\s*Product Dimensions\s*')).find_next_sibling('td').text.strip()
}
print(d)
首先,如果你想找到直接下一个兄弟姐妹,你需要使用.find_next_sibling()
而不是.find_next_siblings()
。那么你没有得到任何输出的原因是标签内文本的表示。如果你这样做:
print([each_th.text for each_th in soup.find_all('th')])
您会看到结果如下所示:
['\nItem Weight\n', '\nProduct Dimensions\n', '\nBatteries Included?\n', '\nBatteries Required?\n']
因此,您需要将 text='Item Weight'
更改为 text='\nItem Weight\n'
等等:
try:
product = {
'weight': soup.find(text='\nItem Weight\n').parent.find_next_sibling().text,
'dimension': soup.find(text='\nProduct Dimensions\n').parent.find_next_sibling().text
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
这将得到:
{'weight': '\n0.16 ounces\n', 'dimension': '\n4.8 x 3.4 x 0.5 inches\n'}
现在如果你想删除那些换行符,你可以在抓取时使用.replace('\n', '')
或.strip()
来完成。
我只想从下面的“内容”中提取商品重量和产品尺寸。我在这里错过了什么?在我的脚本中,没有找到我要查找的内容。有没有更简单的方法来提取商品重量和产品尺寸?谢谢
import bs4 as bs
content = '''
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = bs.BeautifulSoup(content, features='lxml')
try:
product = {
'weight': soup.find(text='Item Weight').parent.find_next_siblings(),
'dimension': soup.find(text='Product Dimensions').parent.find_next_siblings()
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
print(product)
回溯:
{'weight': 'item unavailable', 'dimension': 'item unavailable'}
您错误地使用了查找下一个兄弟姐妹。 td
标签是 th
标签的同级标签,而不是父 tr
标签的同级标签。
from bs4 import BeautifulSoup
import re
content = '''
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = BeautifulSoup(content, 'html.parser')
d = {
'weight': soup.find('th', text=re.compile('\s*Item Weight\s*')).find_next_sibling('td').text.strip(),
'dimension': soup.find('th', text=re.compile('\s*Product Dimensions\s*')).find_next_sibling('td').text.strip()
}
print(d)
首先,如果你想找到直接下一个兄弟姐妹,你需要使用.find_next_sibling()
而不是.find_next_siblings()
。那么你没有得到任何输出的原因是标签内文本的表示。如果你这样做:
print([each_th.text for each_th in soup.find_all('th')])
您会看到结果如下所示:
['\nItem Weight\n', '\nProduct Dimensions\n', '\nBatteries Included?\n', '\nBatteries Required?\n']
因此,您需要将 text='Item Weight'
更改为 text='\nItem Weight\n'
等等:
try:
product = {
'weight': soup.find(text='\nItem Weight\n').parent.find_next_sibling().text,
'dimension': soup.find(text='\nProduct Dimensions\n').parent.find_next_sibling().text
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
这将得到:
{'weight': '\n0.16 ounces\n', 'dimension': '\n4.8 x 3.4 x 0.5 inches\n'}
现在如果你想删除那些换行符,你可以在抓取时使用.replace('\n', '')
或.strip()
来完成。