使用 "number string number number" 提取行并将其写入数据框

Extract lines with "number string number number" and write it to data frame

我的数据集如下(摘录):

2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
193
dasdasdasd (asasas) sdasdasd
adsadsd asdasd asasa asassaas asas 
asas asas 31. January 2018 (continue)
asdasd – 99,00% (31. March 2017 – 99,98%) (continue)
amasdasd asas
asasas asas
asas asssssssssss
DDD
asdasdads in %
asdasd adasd asddasad 
(continue)
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14

我只想提取那些我有 "number string number number" 的行,这样我的数据最终应该如下所示:

Column 1 Column 2 Column 3 Colum 4
2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14

知道怎么做吗?基本上,我特别需要帮助的地方是创建正则表达式以过滤这些行并将提取的信息写入数据框,以便我可以对其进行一些分析。

我可以为您提供所需查询的正则表达式:

\d*[.]?\d* [a-zA-Z]+ \d*[.]?\d* \d*[.]?\d*

如何解析您的数据并将其导入数据框是我将留给您的任务。

使用它来匹配 "number (int or decimal) string number number" 与您的每一行数据。

你可以试试:


data = """2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
193
dasdasdasd (asasas) sdasdasd
adsadsd asdasd asasa asassaas asas 
asas asas 31. January 2018 (continue)
asdasd – 99,00% (31. March 2017 – 99,98%) (continue)
amasdasd asas
asasas asas
asas asssssssssss
DDD
asdasdads in %
asdasd adasd asddasad 
(continue)
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14"""

reader = StringIO(data)
pattern = re.compile(r'([\d\.\,]+)\s+(\D*)([\d\.\,]+)\s([\d\.\,]+)$')
rows = []
for row in reader:
    match = pattern.search(row)
    if match:
        rows.append([match.group(1), match.group(2), match.group(3), match.group(4)])
df = pd.DataFrame(rows, columns=["Column 1", "Column 2", "Column 3", "Column 4"])

输出

Column 1    Column 2    Column 3    Column 4
0   2.000   Company A   8.876   0,02
1   248 Enterprise B    26.028  0,07
2   415 Company C Ltd.  21.412  0,06
3   668 Enterprise D AG 17.332  0,05
4   1.240   Company E GmbH  31.394  0,09
5   798 Enterprise OHG  52.586  0,14

这将满足您的要求,

pattern = r'[-+]?[0-9]*\.?[0-9]+ [a-zA-Z]*\.? [a-zA-Z]*\.?[a-zA-Z]*\.?.+ [-+]?[0-9]*\.?[0-9]+ [-+]?[0-9]*\,?[0-9]'
out=re.findall(pattern,yourstring)