如何清理从 BeautifulSoup、Pandas、Python 提取的数据

How to clean up pulled data from BeautifulSoup, Pandas, Python

大家好,我有使用 BeautiuflSoup 提取的信息,但我似乎无法正确打印出来发送给 pandasexcel.

html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''

我的代码用来提取我想要的数据:

soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
    print (child.text)

这是它提取的信息但是它打印出来的信息很奇怪,间距很大

        07/01/2022   Date





  Comment
       [1] Comments

理想情况下,我只需要打印出(日期和文件日期)的顶部部分,但至少我需要帮助将其转换为列表格式,例如:

07/01/2022 Date
Comment
[1] Comments

到目前为止一切顺利,这是我的努力

doc='''

<li class="list-group-item">
 <div>
  <div class="tyler-toggle-controller open">
   <p class="text-primary">
    07/01/2022 Date
    <span class="caret">
    </span>
   </p>
  </div>
  <div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;"> 
   <p class="col-sm-12 col-md-12">
    <span class="text-muted">
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')

text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)

输出:

['07/01/2022, Comments']   

试试这个方法,一定有效

text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list



  

要按照问题中的预期打印信息,您可以使用 stripped_strings 并遍历其元素:

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

注意: 在新代码中使用 find_all() 而不是旧语法 findAll()

例子

html='''
<li class="list-group-item">
 <div>
  <div class="tyler-toggle-controller open">
   <p class="text-primary">
    07/01/2022 Date
    <span class="caret">
    </span>
   </p>
  </div>
  <div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;"> 
   <p class="col-sm-12 col-md-12">
    <span class="text-muted">
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

输出

07/01/2022 Date
Comment
[1] Comments

不确定,因为你在谈论 pandas,你也可以选择每个信息,清理它并附加到字典列表中:

data = []
for e in soup.find_all('li',class_='list-group-item'):
    data.append({
        'date': e.p.text.strip().replace(' Date',''),
        'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
    })
pd.DataFrame(data)

data = [{
    'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
    'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]

输出

date comment
07/01/2022 [1] Comments