创建 lambda 函数以应用于 select df 列
Create lambda function to apply to select df columns
我有以下 df:
id header1 header2 diabetes obesity hypertension/high blood pressure. . .
1 metabolism diabetes no no no
2 heart issue heart disease None None None
3 obesity diabetes yes no no
4 metabolism had hypertension no no yes
5 heart issue heart disease no no yes
6 obesity diabetes yes yes no
7 obesity diabetes no no yes
我想创建一个迭代通过 header1 和 header2 的 lambda 函数,检查任一单元格是否是列名称的子字符串。根据该列是否具有 yes、no 或 null,return 具有标志值的列。
对于 header1 或 header2 中的每个单元格,如果它在列名称中包含子字符串匹配并且该列中有一个是,则将新列标记为 2。如果任何类别列包含是,但不是与 header1 和 header2 的关键字匹配,输入 1。否则,留空!
示例)
尝试:
cols = [x for x in df.columns if x not in ['header1', 'header2']]
df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else
期望的结果:
id header1 header2 diabetes obesity hypertension/high blood pressure | flag
1 metabolism diabetes no no no None
2 heart issue heart disease None None None None
3 obesity diabetes yes no no 2
4 metabolism had hypertension no no yes 2
5 heart issue heart disease no no yes 1
6 obesity diabetes yes yes no 2
7 obesity diabetes no no yes 1
构造函数
请注意,我的实际 df 具有动态数量的 yes/no 列,但只有两个 header 列。
data = np.array([('metabolism','diabetes','no','no', 'no'),
('heart issue', 'heart disease', None,None,None),
('obesity','diabetes','yes','no','no'),
('metabolism',' had hypertension','no','no','yes'),
('heart issue', 'heart disease','no','no','yes'),
('obesity', 'diabetes','yes','yes', 'no'),
('obesity', 'diabetes', 'no','no', 'yes')])
df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])
cols = [x for x in df.columns if x not in ['header1', 'header2']]
首先创建疾病列索引和疾病名称系列(后者用于抓取“高血压”)。
然后简单地应用一个函数,首先计算“是”答案并在“是”答案中搜索疾病名称
headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]
def get_flag(row):
yes = row[disease_cols].eq('yes')
if sum(yes) > 0:
return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
else:
return np.nan
df['flag'] = df.apply(get_flag, axis=1)
输出:
header1 header2 diabetes obesity hypertension/high blood pressure flag
0 metabolism diabetes no no no NaN
1 heart issue heart disease no no no NaN
2 obesity diabetes yes no no 2.0
3 metabolism hypertension no no yes 2.0
4 heart issue heart disease no no yes 1.0
5 obesity diabetes yes yes no 2.0
6 obesity diabetes no no yes 1.0
我有以下 df:
id header1 header2 diabetes obesity hypertension/high blood pressure. . .
1 metabolism diabetes no no no
2 heart issue heart disease None None None
3 obesity diabetes yes no no
4 metabolism had hypertension no no yes
5 heart issue heart disease no no yes
6 obesity diabetes yes yes no
7 obesity diabetes no no yes
我想创建一个迭代通过 header1 和 header2 的 lambda 函数,检查任一单元格是否是列名称的子字符串。根据该列是否具有 yes、no 或 null,return 具有标志值的列。
对于 header1 或 header2 中的每个单元格,如果它在列名称中包含子字符串匹配并且该列中有一个是,则将新列标记为 2。如果任何类别列包含是,但不是与 header1 和 header2 的关键字匹配,输入 1。否则,留空!
示例)
尝试: cols = [x for x in df.columns if x not in ['header1', 'header2']]
df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else
期望的结果:
id header1 header2 diabetes obesity hypertension/high blood pressure | flag
1 metabolism diabetes no no no None
2 heart issue heart disease None None None None
3 obesity diabetes yes no no 2
4 metabolism had hypertension no no yes 2
5 heart issue heart disease no no yes 1
6 obesity diabetes yes yes no 2
7 obesity diabetes no no yes 1
构造函数
请注意,我的实际 df 具有动态数量的 yes/no 列,但只有两个 header 列。
data = np.array([('metabolism','diabetes','no','no', 'no'),
('heart issue', 'heart disease', None,None,None),
('obesity','diabetes','yes','no','no'),
('metabolism',' had hypertension','no','no','yes'),
('heart issue', 'heart disease','no','no','yes'),
('obesity', 'diabetes','yes','yes', 'no'),
('obesity', 'diabetes', 'no','no', 'yes')])
df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])
cols = [x for x in df.columns if x not in ['header1', 'header2']]
首先创建疾病列索引和疾病名称系列(后者用于抓取“高血压”)。
然后简单地应用一个函数,首先计算“是”答案并在“是”答案中搜索疾病名称
headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]
def get_flag(row):
yes = row[disease_cols].eq('yes')
if sum(yes) > 0:
return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
else:
return np.nan
df['flag'] = df.apply(get_flag, axis=1)
输出:
header1 header2 diabetes obesity hypertension/high blood pressure flag
0 metabolism diabetes no no no NaN
1 heart issue heart disease no no no NaN
2 obesity diabetes yes no no 2.0
3 metabolism hypertension no no yes 2.0
4 heart issue heart disease no no yes 1.0
5 obesity diabetes yes yes no 2.0
6 obesity diabetes no no yes 1.0