Pandas 在另一列中提取参考名称、中间名和姓氏

Question

将 jupyter 与 pandas 结合使用我需要在另一列中提取出现在任何冒号之后的引用，例如：

nameis: joe doe, the student is....
nameis: patric test, this question is...
nameis: franck joe and he is.....
nameis: lucash de brown and the academic achievement......

这个问题对我来说变得很复杂，正是当我必须在名字之后提取时：名字和姓氏，不幸的是随后由任何文本阐明！在这种情况下唯一的参考是 nameis: 这是重复出现的，我想把名字和姓氏放在另一个专门的栏目上！

first_last_name,column_2....
joe doe,....
patric test,....
franck joe,......
lucash de brown,.....

并非所有的名字和姓氏都以逗号结尾，但在极端情况下，我很乐意只带那些！与此同时，我想到了让名字更接近nameis:

df['column'] = df['column'].str.replace(r'nameis: ', '')

然后类似的事情，但不幸的是我仍然！特别是在处理中间名时

pat=r'([nameis:]+[a-zA-Z])'
df['first_last_name']=df['column'].str.extract(pat,expand=False)
df

感谢所有帮助过我的人！

UPDATE:

字符串捕获的完美操作：

df['column'].str.extract('nameis: (?P<first_last_name>[^,]+?)(?:,|\s*and) (?P<column_2>.*)')

我需要进一步澄清此事：如果在同一行我有更多 nameis: 我怎样才能提取秒..三分之一..等等？

示例：

nameis: joe doe, the student is has excellent marks in the subject of professor nameis: adrian muller, ....
nameis: patric test, in the subject of the teacher nameis: adam joe, ...

与：

df['column'].str.extract('nameis: (?P<first_last_name>[^,]+?)(?:,|\s*and) (?P<column_2>.*)')

我只能提取第一个nameis:！我该如何提取它们并将它们放在同一列中，用逗号分隔？

Answer 1

您可以使用 str.extract 和带有命名捕获组的正则表达式：

df = pd.DataFrame({'column': ['nameis: joe doe, the student is....',
                              'nameis: patric test, this question is...',
                              'nameis: franck joe and he is.....',
                              'nameis: lucash de brown and the academic achievement......']})

df['column'].str.extract('nameis: (?P<first_last_name>[^,]+?)(?:,|\s*and) (?P<column_2>.*)')

输出：

   first_last_name                        column_2
0          joe doe              the student is....
1      patric test             this question is...
2       franck joe                      he is.....
3  lucash de brown  the academic achievement......

如果你只想要名字：

print(df['column'].str.extract('nameis: (?P<first_last_name>[^,]+?)(?:,|\s*and)'))

输出：

   first_last_name
0          joe doe
1      patric test
2       franck joe 
3  lucash de brown

Pandas 在另一列中提取参考名称、中间名和姓氏

Pandas extract in another column the reference name, middle name and surname

python

pandas

jupyter-notebook