解析除以 <br> 但不在 <span> 内的文本

Question

我不知道如何解析这种类型的数据：

<div id="tabs-1" class="ui-tabs-panel ui-widget-content ui-corner-bottom">



            <strong><span itemprop="name">MOS-SCHAUM</span></strong><br>
        <span itemprop="description">Antistatická pena čierna na IO 300x300x6mm</span>


        <br>RoHS: Áno           
        <br>Obj.číslo: 13291<br>



        </div>

代码段中可能有很多 <span> 标签 - 我不想获取它们。我只想要那些不在 <span> 标签内的。

所以结果是：

{'RoHS':'Áno',
 'Obj.číslo': '13291'}

我正在考虑 .contents 但很难预测哪些元素将位于哪个索引上。

你知道怎么做吗？

编辑：即使我尝试这样做：

detail_table = soup.find('div',id="tabs-1")                              
itemprops = detail_table.find_all('span',itemprop=re.compile('.+'))      
for item in itemprops:                                                   
    data[item['itemprop']]=item                                          

contents = detail_table.contents[-1].contents[-1].contents[-1].contents

for i,c in enumerate(contents):
print c                    
print '---'

我明白了：

RoHS: Áno           
                                 # 1st element
---
<br>Obj.Ä�Ãslo: 68664<br>
</br></br>                        # 2st element
---

EDIT2：我刚找到一个解决方案，但不是很好。一定有更优雅的解决方案：

def get_data(url):                                                                 
    data = {}                                                                      
    soup = get_soup(url)                                                           

    """ TECHNICAL INFORMATION """                                                  
    tech_par_table = soup.find('div',id="tabs-2")                                  
    trs = tech_par_table.find_all('tr')                                            
    for tr in trs:                                                                 
        tds = tr.find_all('td')                                                    
        parameter = tds[0].text                                                    
        value = tds[1].text                                                        
        data[parameter]=value                                                      

    """ DETAIL """                                                                 
    detail_table = soup.find('div',id="tabs-1")                                    
    itemprops = detail_table.find_all('span',itemprop=re.compile('.+'))            
    for item in itemprops:                                                         
        data[item['itemprop'].replace('\n','').replace('\t','').strip()]=item.text.

    contents = detail_table.contents[-1].contents[-1].contents[-1].contents        

    for i,c in enumerate(contents):                                                
        if isinstance(c,bs4.element.NavigableString):                              
            splitted = c.split(':')                                                
            data[splitted[0]]=splitted[1].replace('\n','').replace('\t','').strip()
        if isinstance(c,bs4.element.Tag):                                          
            splitted = c.text.split(':')                                           
            data[splitted[0]]=splitted[1].replace('\n','').replace('\t','').strip()

Answer 1

首先，您需要获取所有 br 标记，并使用 .next_element 属性获取每个 br 标记后立即解析的内容；这是您的文字。

d = {}

for br in soup.find_all('br'):
    text = br.next_element.strip()
    if text:
        arr = text.split(':')
        d[arr[0]] = arr[1].strip()
print(d)

产量：

{'Obj.číslo': '13291', 'RoHS': 'Áno'}

解析除以 <br> 但不在 <span> 内的文本

Parse text divided by <br> but not inside <span>

html

python

parsing

beautifulsoup

web-scraping