python 通过正则表达式和列表理解从字符串中提取数字值

python extract digit values from string via regular expressions and list comprehension

我想提取这个

3.76    2.35    3.30    5.08     NaN    8.44    10.00
3.76    2.35    3.30    4.99    6.63    8.42    10.00
1.50    1.50    1.60    2.00    2.60    3.35    3.85
NaN      NaN    NaN     NaN     NaN     0.00    0.00

来自 bs4 操作的以下 return:

[<td class="font-bold">Ergebnis je Aktie (unverwässert, nach Steuern)</td>, <td>3,76</td>, 
<td>2,35</td>, <td>3,30</td>, <td>5,08</td>, <td>-</td>, <td>8,44</td>, <td>10,00</td>, <td class="font-
bold">Ergebnis je Aktie (verwässert, nach Steuern)</td>, <td>3,76</td>, <td>2,35</td>, <td>3,30</td>,
 <td>4,99</td>, <td>6,63</td>, <td>8,42</td>, <td>10,00</td>, <td class="font-bold">Dividende je 
Aktie</td>, <td>1,50</td>, <td>1,50</td>, <td>1,60</td>, <td>2,00</td>, <td>2,60</td>, <td>3,35</td>,
 <td>3,85</td>, <td class="font-bold">Gesamtdividendenausschüttung in Mio.</td>, <td>-</td>, <td>-</td>,
 <td>-</td>, <td>-</td>, <td>-</td>, <td>0,00</td>, <td>0,00</td>]

我试过

def get_table_entries(element, len_colums):    
        #--------------------------------
        #
        _re_digits = re.compile("-?\d+\.?\d+")
        #--------------------------------
        # find all table entries
        entries = []
        temp = element.findAll("td")
        temp = str(temp)
        #print(temp)
        #--------------------------------
        # replace elements and extract digits from string
        temp = temp.replace('.', '') 
        temp = temp.replace(',', '.')

        print(temp)
        entries += [ n for n in _re_digits.findall(temp)]
        #--------------------------------
        # reshape output array to fit original table shape and return entries
        print(entries)
        entries = np.reshape(entries, (-1, len_colums))

        return entries

但是这个解决方案也会在 <td>-</td> 我想转换成 NaN。但是当我保留减号并通过 temp = temp.replace('-', 'NaN') 替换它时,我仍然会在以下列表理解中遇到错误。

也许最简单的方法是定义一个辅助函数:

def to_float(s): 
    if s == "-": 
        return float("nan") 
    else: 
        return float(s.replace(",", ".")) 

然后在单元格上写一个基本循环:

values = []
for elem in soup.find_all("td"): 
    try: 
        values.append(to_float(elem.text)) 
    except ValueError: 
        pass 

现在可以很容易地转换为所需形状的 numpy 数组:

>>> np.array(values).reshape(-1, 7)
array([[ 3.76,  2.35,  3.3 ,  5.08,   nan,  8.44, 10.  ],
       [ 3.76,  2.35,  3.3 ,  4.99,  6.63,  8.42, 10.  ],
       [ 1.5 ,  1.5 ,  1.6 ,  2.  ,  2.6 ,  3.35,  3.85],
       [  nan,   nan,   nan,   nan,   nan,  0.  ,  0.  ]])