使用另一列中的预定义类别,根据其中存在的文本词对 'string' 列进行分类
Categorize a 'string' column based on the text word present in it using pre-defined categories in another column
我有一个包含电子邮件域的 pandas 列,如下所示:
Sno Domain_IDs
1 herowire.com
2 xyzenerergy.com
3 financial.com
4 oo-loans.com
5 okwire.com
6 cleaneneregy.com
7 pop-advisors.com
等等....
我在单独的数据框中有以下类别:
Sno category
1 contains wire
2 contains energy
3 contains loans
4 contains advisors
我想创建一个将数据分类如下的数据框:
Sno Domain_IDS category
1 herowire.com contains wire
2 xyzenerergy.com contains energy
3 financial.com others
4 oo-loans.com contains loans
5 okwire.com contains wire
6 cleaneneregy.com contains energy
7 pop-advisors.com contains advisors
我尝试使用 lambda 函数和使用 "if else" 语句的标准循环,通过使用
"emailAddress.str.contains('wire')"
包含子句,但出现以下错误:
AttributeError: 'str' object has no attribute 'str'
不知何故,我无法解析数据框中的单行文本。请帮忙。
lst = ["wire", "energy", "loans","advisors"]
def fun(a):
for i in lst:
if i in a:
return i
return "others"
df["category"] = df.Domain_IDs.apply(lambda x: fun(x))
df
Sno Domain_IDs category
0 1 herowire.com wire
1 2 xyzenenergy.com energy
2 3 financial.com others
3 4 oo-loans.com loans
4 5 okwire.com wire
5 6 cleanenergy.com energy
6 7 pop-advisors.com advisors
在域中查找模式,提取并创建类别
pat = '('+'|'.join(cat['Sno category'].str.split().str[-1])+')'
df['category'] = ('contains ' + df['Domain_IDs'].str.extract(pat)).fillna('other')
Sno Domain_IDs category
0 1 herowire.com contains wire
1 2 xyzenenergy.com contains energy
2 3 financial.com other
3 4 oo-loans.com contains loans
4 5 okwire.com contains wire
5 6 cleaneneregy.com other
6 7 pop-advisors.com contains advisors
此解决方案允许多个分类:
categories = pd.DataFrame({"category": ["wire", "energy", "loans", "advisors"]})
domains = pd.DataFrame({"Sno": list(range(1, 10)),
"Domain_IDs": [
"herowire.com",
"xyzenergy.com",
"financial.com",
"oo-loans.com",
"okwire.com",
"cleanenergy.com",
"pop-advisors.com",
"energy-advisors.com",
"wire-loans.com"]})
categories["common"] = 0
domains["common"] = 0
possibilities = pd.merge(categories, domains, how="outer")
possibilities["satisfied"] = possibilities.apply(lambda row: row["category"] in row["Domain_IDs"], axis=1)
所以只过滤满足的类别:
possibilities[possibilities["satisfied"]]
给出:
category common Domain_IDs Sno satisfied
0 wire 0 herowire.com 1 True
4 wire 0 okwire.com 5 True
8 wire 0 wire-loans.com 9 True
10 energy 0 xyzenergy.com 2 True
14 energy 0 cleanenergy.com 6 True
16 energy 0 energy-advisors.com 8 True
21 loans 0 oo-loans.com 4 True
26 loans 0 wire-loans.com 9 True
33 advisors 0 pop-advisors.com 7 True
34 advisors 0 energy-advisors.com 8 True
我有一个包含电子邮件域的 pandas 列,如下所示:
Sno Domain_IDs
1 herowire.com
2 xyzenerergy.com
3 financial.com
4 oo-loans.com
5 okwire.com
6 cleaneneregy.com
7 pop-advisors.com
等等....
我在单独的数据框中有以下类别:
Sno category
1 contains wire
2 contains energy
3 contains loans
4 contains advisors
我想创建一个将数据分类如下的数据框:
Sno Domain_IDS category
1 herowire.com contains wire
2 xyzenerergy.com contains energy
3 financial.com others
4 oo-loans.com contains loans
5 okwire.com contains wire
6 cleaneneregy.com contains energy
7 pop-advisors.com contains advisors
我尝试使用 lambda 函数和使用 "if else" 语句的标准循环,通过使用
"emailAddress.str.contains('wire')"
包含子句,但出现以下错误:
AttributeError: 'str' object has no attribute 'str'
不知何故,我无法解析数据框中的单行文本。请帮忙。
lst = ["wire", "energy", "loans","advisors"]
def fun(a):
for i in lst:
if i in a:
return i
return "others"
df["category"] = df.Domain_IDs.apply(lambda x: fun(x))
df
Sno Domain_IDs category
0 1 herowire.com wire
1 2 xyzenenergy.com energy
2 3 financial.com others
3 4 oo-loans.com loans
4 5 okwire.com wire
5 6 cleanenergy.com energy
6 7 pop-advisors.com advisors
在域中查找模式,提取并创建类别
pat = '('+'|'.join(cat['Sno category'].str.split().str[-1])+')'
df['category'] = ('contains ' + df['Domain_IDs'].str.extract(pat)).fillna('other')
Sno Domain_IDs category
0 1 herowire.com contains wire
1 2 xyzenenergy.com contains energy
2 3 financial.com other
3 4 oo-loans.com contains loans
4 5 okwire.com contains wire
5 6 cleaneneregy.com other
6 7 pop-advisors.com contains advisors
此解决方案允许多个分类:
categories = pd.DataFrame({"category": ["wire", "energy", "loans", "advisors"]})
domains = pd.DataFrame({"Sno": list(range(1, 10)),
"Domain_IDs": [
"herowire.com",
"xyzenergy.com",
"financial.com",
"oo-loans.com",
"okwire.com",
"cleanenergy.com",
"pop-advisors.com",
"energy-advisors.com",
"wire-loans.com"]})
categories["common"] = 0
domains["common"] = 0
possibilities = pd.merge(categories, domains, how="outer")
possibilities["satisfied"] = possibilities.apply(lambda row: row["category"] in row["Domain_IDs"], axis=1)
所以只过滤满足的类别:
possibilities[possibilities["satisfied"]]
给出:
category common Domain_IDs Sno satisfied
0 wire 0 herowire.com 1 True
4 wire 0 okwire.com 5 True
8 wire 0 wire-loans.com 9 True
10 energy 0 xyzenergy.com 2 True
14 energy 0 cleanenergy.com 6 True
16 energy 0 energy-advisors.com 8 True
21 loans 0 oo-loans.com 4 True
26 loans 0 wire-loans.com 9 True
33 advisors 0 pop-advisors.com 7 True
34 advisors 0 energy-advisors.com 8 True