在 pandas 列中提取正则表达式
Extract regex in pandas column
您好,我想将 Df 列中不同产品的件数添加到一个新列中。目前,数字位于产品类型之后。
数据如下所示:
PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL
*Keeps going more entries...*
我的函数如下所示:
df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)
产品可以是 [TAB、COMP、AMP、SOB、PAST、GRAG ... 以及其他]
我想得到这样的东西:
PRODUCTS PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB 30
LIPITOR 40 MG 1+1 ORAL 15 TAB 15
LOFTYL 150 MG ORAL 30 TAB 30
SOMAZINA 500 MG ORAL 10 COMP RECUB 10
LOFTYL 30 TAB 150 MG ORAL 30
我可以在我的行中更改什么以获得如下内容?
感谢您的阅读和帮助。
您可以使用
import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 TAB RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 MG ORAL 30 TAB 30
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
>>>
如果tab
、cap
等词是完整的词,不能是较长词的一部分,则需要加一个词模式末尾的边界,即 rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)\b'
.
见regex demo。 详情:
(?i)
- 不区分大小写的内联修饰符
(\d*\.?\d+)
- 第 1 组:零个或多个数字,一个可选的 .
,然后是一个或多个数字
\s*
- 零个或多个空白字符
(?:tab|cap|grag|past|sob|comp)
- 一个非捕获组(以免干扰 Series.str.extract
输出)匹配其中的任何替代子字符串
\b
- 单词边界。
也许..
给定一个数据框(注意:我让产品在一行中出现两次作为示例,以防发生这种情况)...
PRODUCTS
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB
1 LIPITOR 40 MG 1+1 ORAL 15 TAB
2 LOFTYL 150 GRAG ORAL 30 TAB
3 SOMAZINA 500 MG ORAL 10 COMP RECUB
4 LOFTYL 30 TAB 150 MG ORAL
5 *Keeps going more entries...*
代码:
import pandas as pd
import re
data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
"LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
"LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}
df = pd.DataFrame(data)
# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']
def getProduct(x):
found = list()
for product in products:
pattern = r'(\d+)' + ' ' + str(product)
found.append(re.findall(pattern, x))
found = list(filter(None, found))
found = [item for sublist in found for item in sublist]
found = ", ".join(str(item) for item in found)
return found
df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]
print(df)
输出:
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 GRAG ORAL 30 TAB 30, 150
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
5 *Keeps going more entries...*
您好,我想将 Df 列中不同产品的件数添加到一个新列中。目前,数字位于产品类型之后。
数据如下所示:
PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL
*Keeps going more entries...*
我的函数如下所示:
df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)
产品可以是 [TAB、COMP、AMP、SOB、PAST、GRAG ... 以及其他]
我想得到这样的东西:
PRODUCTS PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB 30
LIPITOR 40 MG 1+1 ORAL 15 TAB 15
LOFTYL 150 MG ORAL 30 TAB 30
SOMAZINA 500 MG ORAL 10 COMP RECUB 10
LOFTYL 30 TAB 150 MG ORAL 30
我可以在我的行中更改什么以获得如下内容?
感谢您的阅读和帮助。
您可以使用
import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 TAB RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 MG ORAL 30 TAB 30
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
>>>
如果tab
、cap
等词是完整的词,不能是较长词的一部分,则需要加一个词模式末尾的边界,即 rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)\b'
.
见regex demo。 详情:
(?i)
- 不区分大小写的内联修饰符(\d*\.?\d+)
- 第 1 组:零个或多个数字,一个可选的.
,然后是一个或多个数字\s*
- 零个或多个空白字符(?:tab|cap|grag|past|sob|comp)
- 一个非捕获组(以免干扰Series.str.extract
输出)匹配其中的任何替代子字符串\b
- 单词边界。
也许..
给定一个数据框(注意:我让产品在一行中出现两次作为示例,以防发生这种情况)...
PRODUCTS
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB
1 LIPITOR 40 MG 1+1 ORAL 15 TAB
2 LOFTYL 150 GRAG ORAL 30 TAB
3 SOMAZINA 500 MG ORAL 10 COMP RECUB
4 LOFTYL 30 TAB 150 MG ORAL
5 *Keeps going more entries...*
代码:
import pandas as pd
import re
data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
"LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
"LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}
df = pd.DataFrame(data)
# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']
def getProduct(x):
found = list()
for product in products:
pattern = r'(\d+)' + ' ' + str(product)
found.append(re.findall(pattern, x))
found = list(filter(None, found))
found = [item for sublist in found for item in sublist]
found = ", ".join(str(item) for item in found)
return found
df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]
print(df)
输出:
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 GRAG ORAL 30 TAB 30, 150
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
5 *Keeps going more entries...*