如何清理从 BeautifulSoup、Pandas、Python 提取的数据
How to clean up pulled data from BeautifulSoup, Pandas, Python
大家好,我有使用 BeautiuflSoup
提取的信息,但我似乎无法正确打印出来发送给 pandas
和 excel
.
html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
我的代码用来提取我想要的数据:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
这是它提取的信息但是它打印出来的信息很奇怪,间距很大
07/01/2022 Date
Comment
[1] Comments
理想情况下,我只需要打印出(日期和文件日期)的顶部部分,但至少我需要帮助将其转换为列表格式,例如:
07/01/2022 Date
Comment
[1] Comments
到目前为止一切顺利,这是我的努力
doc='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
输出:
['07/01/2022, Comments']
试试这个方法,一定有效
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list
要按照问题中的预期打印信息,您可以使用 stripped_strings
并遍历其元素:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
注意: 在新代码中使用 find_all()
而不是旧语法 findAll()
。
例子
html='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
输出
07/01/2022 Date
Comment
[1] Comments
不确定,因为你在谈论 pandas
,你也可以选择每个信息,清理它并附加到字典列表中:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
或
data = [{
'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]
输出
date
comment
07/01/2022
[1] Comments
大家好,我有使用 BeautiuflSoup
提取的信息,但我似乎无法正确打印出来发送给 pandas
和 excel
.
html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
我的代码用来提取我想要的数据:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
这是它提取的信息但是它打印出来的信息很奇怪,间距很大
07/01/2022 Date
Comment
[1] Comments
理想情况下,我只需要打印出(日期和文件日期)的顶部部分,但至少我需要帮助将其转换为列表格式,例如:
07/01/2022 Date
Comment
[1] Comments
到目前为止一切顺利,这是我的努力
doc='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
输出:
['07/01/2022, Comments']
试试这个方法,一定有效
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list
要按照问题中的预期打印信息,您可以使用 stripped_strings
并遍历其元素:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
注意: 在新代码中使用 find_all()
而不是旧语法 findAll()
。
例子
html='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
输出
07/01/2022 Date
Comment
[1] Comments
不确定,因为你在谈论 pandas
,你也可以选择每个信息,清理它并附加到字典列表中:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
或
data = [{
'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]
输出
date | comment |
---|---|
07/01/2022 | [1] Comments |