Python、pandas 如何通过查找特定单词而不是“,”或“_”等来拆分字符串
Python, pandas How to split a string by finding a specific word rather then "," or "_" and etc
我很难尝试从字符串中提取 ID 号。
我可以使用索引获取它,但对于数据框的其他行它会失败。
如何以适用于所有行的方式提取 campaignid=351154190
。
唯一的模式是单词 campaignid
,需要提取并存储在数据框中的新列中。性能在此任务中并不重要。
原始字符串
https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=3
51154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposit
ion=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD
_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension
拆分字符串
x= cw.captureurl.str.split('&').str[:-1]
打印一行
print(x[25])
['https:_utm_source=googlebrand', 'utm_medium=ppc', 'utm_campaign=brand',
'utm_campaignid=35119190', 'keyword=co',
'utm_matchtype=e', 'device=m', 'utm_network=g', 'utm_adposition=1t1',
'geo=9027258', 'gclid=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'affiliate_id=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'utm_content=search', 'utm_contentid=1211732930', 'placement']
如果我可以使用可以搜索“campaignid”(我的目标是什么)这个词的东西,那就太好了
然后将其存储在某个数据框的另一列中。
我试过一次又一次的拆分,但没用
我尝试在 if 语句中使用 for 循环,也没有用。
使用正则表达式:
campaign_id = cw['captureurl'].str.extract('campaignid=(\d+)')[0]
我建议使用 urllib
。特别是,parse_qs
函数将获得一个字符串参数字典。 https://docs.python.org/3/library/urllib.parse.html
使用您的示例 URL 我们得到:
from urllib.parse import parse_qs
test = 'https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=351154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposition=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension'
print(parse_qs(test))
{'https:_utm_source': ['googlebrand'],
'utm_medium': ['ppc'],
'utm_campaign': ['brand'],
'utm_campaignid': ['351154190'],
'keyword': ['aihdisadjiajdutm_matchtype=e'],
'device': ['m'],
'utm_network': ['g'],
'utm_adposition': ['1t1'],
'geo': ['9027258'],
'gclid': ['CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE'],
'affiliate_id': ['asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE'],
'utm_content': ['search'],
'utm_contentid': ['1251489456158180']}
要获取整个数据框的活动 ID,我们可以使用 .apply
来完成此操作:
# After parsing each url's arguments, we extract the first campaignid from the dictionary's list.
df['utm_campaignid'] = df['url'].apply(lambda x: parse_qs(x)['utm_campaignid'][0])
df.head()
url utm_campaignid
0 https:_utm_source=googlebrand&utm_medium=ppc&u... 351154190
我很难尝试从字符串中提取 ID 号。
我可以使用索引获取它,但对于数据框的其他行它会失败。
如何以适用于所有行的方式提取 campaignid=351154190
。
唯一的模式是单词 campaignid
,需要提取并存储在数据框中的新列中。性能在此任务中并不重要。
原始字符串
https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=3
51154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposit
ion=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD
_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension
拆分字符串
x= cw.captureurl.str.split('&').str[:-1]
打印一行
print(x[25])
['https:_utm_source=googlebrand', 'utm_medium=ppc', 'utm_campaign=brand',
'utm_campaignid=35119190', 'keyword=co',
'utm_matchtype=e', 'device=m', 'utm_network=g', 'utm_adposition=1t1',
'geo=9027258', 'gclid=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'affiliate_id=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'utm_content=search', 'utm_contentid=1211732930', 'placement']
如果我可以使用可以搜索“campaignid”(我的目标是什么)这个词的东西,那就太好了
然后将其存储在某个数据框的另一列中。
我试过一次又一次的拆分,但没用 我尝试在 if 语句中使用 for 循环,也没有用。
使用正则表达式:
campaign_id = cw['captureurl'].str.extract('campaignid=(\d+)')[0]
我建议使用 urllib
。特别是,parse_qs
函数将获得一个字符串参数字典。 https://docs.python.org/3/library/urllib.parse.html
使用您的示例 URL 我们得到:
from urllib.parse import parse_qs
test = 'https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=351154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposition=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension'
print(parse_qs(test))
{'https:_utm_source': ['googlebrand'],
'utm_medium': ['ppc'],
'utm_campaign': ['brand'],
'utm_campaignid': ['351154190'],
'keyword': ['aihdisadjiajdutm_matchtype=e'],
'device': ['m'],
'utm_network': ['g'],
'utm_adposition': ['1t1'],
'geo': ['9027258'],
'gclid': ['CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE'],
'affiliate_id': ['asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE'],
'utm_content': ['search'],
'utm_contentid': ['1251489456158180']}
要获取整个数据框的活动 ID,我们可以使用 .apply
来完成此操作:
# After parsing each url's arguments, we extract the first campaignid from the dictionary's list.
df['utm_campaignid'] = df['url'].apply(lambda x: parse_qs(x)['utm_campaignid'][0])
df.head()
url utm_campaignid
0 https:_utm_source=googlebrand&utm_medium=ppc&u... 351154190