创建 lambda 函数以应用于 select df 列

Create lambda function to apply to select df columns

我有以下 df:

id   header1     header2      diabetes obesity hypertension/high blood pressure. . .      
 1  metabolism   diabetes          no      no          no
 2  heart issue  heart disease    None     None        None       
 3    obesity    diabetes          yes     no          no
 4   metabolism  had hypertension  no      no          yes
 5   heart issue heart disease     no      no          yes
 6    obesity    diabetes          yes     yes         no
 7    obesity    diabetes          no      no          yes

我想创建一个迭代通过 header1 和 header2 的 lambda 函数,检查任一单元格是否是列名称的子字符串。根据该列是否具有 yes、no 或 null,return 具有标志值的列。

对于 header1 或 header2 中的每个单元格,如果它在列名称中包含子字符串匹配并且该列中有一个是,则将新列标记为 2。如果任何类别列包含是,但不是与 header1 和 header2 的关键字匹配,输入 1。否则,留空!

示例)

尝试: cols = [x for x in df.columns if x not in ['header1', 'header2']]

df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else

期望的结果:

id   header1     header2    diabetes  obesity hypertension/high blood pressure | flag      
 1  metabolism   diabetes         no      no            no                       None                  
 2  heart issue  heart disease  None      None         None                      None
 3    obesity    diabetes         yes     no            no                        2
 4   metabolism had hypertension  no      no            yes                       2
 5   heart issue heart disease    no      no            yes                       1
 6    obesity    diabetes         yes     yes           no                        2
 7    obesity    diabetes          no      no          yes                        1

构造函数

请注意,我的实际 df 具有动态数量的 yes/no 列,但只有两个 header 列。

data = np.array([('metabolism','diabetes','no','no', 'no'), 
                 ('heart issue', 'heart disease', None,None,None),
                 ('obesity','diabetes','yes','no','no'),
                 ('metabolism',' had hypertension','no','no','yes'),
                 ('heart issue', 'heart disease','no','no','yes'),
                 ('obesity', 'diabetes','yes','yes', 'no'),
                 ('obesity', 'diabetes', 'no','no', 'yes')])


df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])

cols = [x for x in df.columns if x not in ['header1', 'header2']]
      

首先创建疾病列索引和疾病名称系列(后者用于抓取“高血压”)。

然后简单地应用一个函数,首先计算“是”答案并在“是”答案中搜索疾病名称

headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]

def get_flag(row):
    yes = row[disease_cols].eq('yes')
    if sum(yes) > 0:
        return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
    else:
        return np.nan


df['flag'] = df.apply(get_flag, axis=1)

输出:

       header1        header2 diabetes obesity hypertension/high blood pressure   flag
0   metabolism       diabetes       no      no                       no           NaN
1  heart issue  heart disease       no      no                       no           NaN
2      obesity       diabetes      yes      no                       no           2.0
3   metabolism   hypertension       no      no                      yes           2.0
4  heart issue  heart disease       no      no                      yes           1.0
5      obesity       diabetes      yes     yes                       no           2.0
6      obesity       diabetes       no      no                      yes           1.0