如何在pyspark中的数据框中的每一行中找到一个字符串
How to find a string in each row in a dataframe in pyspark
这是可用的数据框:
+--------------------+
| Name|
+--------------------+
|Braund, Mr. Owen ...|
|Cumings, Mrs. Joh...|
|Heikkinen, Miss. ...|
|Futrelle, Mrs. Ja...|
|Allen, Mr. Willia...|
|Moran, Mr. James|
|McCarthy, Mr. Tim...|
|Palsson, Master. ...|
|Johnson, Mrs. Osc...|
+--------------------+
我想使用 Pyspark 在 DATA FRAME 的每一行中找到第一次出现的 Title 和 Surname(Pandas lib 在我的集群中不可用)。
pattern=re.compile(r'(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.'
pattern.match(df['Name'])
如果 Name
列的第一个词是 'Surname' 那么您可以试试这个,否则正则表达式需要稍微调整一下。
from pyspark.sql.functions import regexp_extract, col
#sample data
df= sc.parallelize([["Braund, Mr. Owen"],
["Cumings, Mrs. Joh"],
["Heikkinen, Miss."],
["Futrelle, Mrs. Ja"]]).toDF(["Name"])
df = df.withColumn('Surname', regexp_extract(col('Name'), '(\S+),.*', 1))
df.show()
示例数据:
+-----------------+
| Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+
输出为:
+-----------------+---------+
| Name| Surname|
+-----------------+---------+
| Braund, Mr. Owen| Braund|
|Cumings, Mrs. Joh| Cumings|
| Heikkinen, Miss.|Heikkinen|
|Futrelle, Mrs. Ja| Futrelle|
+-----------------+---------+
您可以按照@Prem 的建议使用 regexp_extract
,但使用不同的正则表达式模式,具体取决于您的需要:
# do not keep the first two groups, just what follows, the surname:
pattern = r'(?:(?:Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'
# or keep both title and surname
pattern_with_title = r'((Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'
#sample data
df = spark.createDataFrame([["Braund, Mr. Owen other stuff"],
["Cumings, Mrs. Joh some details"],
["Heikkinen, Miss. Hellen blah"],
["Futrelle, Mrs. Ja .... .... "]], ["Name"])
df.show()
+-----------------+
| Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+
# create a column with what matches the pattern
df = df.withColumn("Surname", regexp_extract("Name", pattern, 1))
df.show()
# keeps only the Surname
+-----------------+---------+
| Name| Surname|
+-----------------+---------+
| Braund, Mr. Owen| Owen |
|Cumings, Mrs. Joh| Joh |
| Heikkinen, Miss.| Hellen |
|Futrelle, Mrs. Ja| Ja |
+-----------------+---------+
# in case you want both title and Surname:
df = df.withColumn("Surname with title", regexp_extract("Name", pattern_with_title, 1))
+-----------------+---------+--------------------+
| Name| Surname| Surname with title|
+-----------------+---------+--------------------+
|Braund, Mr. Owen | Owen | Mr. Ownen |
|Cumings, Mrs. Joh| Joh | Mrs. Joh |
|Heikkinen, Miss..| Hellen | Miss. Hellen |
|Futrelle, Mrs. Ja| Ja | Mrs. Ja |
+-----------------+---------+--------------------+
如果您需要全名、头衔、姓氏,请稍微更改模式以将其也包括在内,例如:
main_pattern = r'Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don'
pattern_full = r'(\w+,?\s('+ main_pattern+')\.?\s?\w+)'
pattern_name = r'(?:(?:'+ main_pattern+')\.?\s?)(\w+)'
pattern_title = r'(?:('+ main_pattern+')\.?\s?)'
pattern_surname = r'(\w+)(?:\,\s?(?:'+ main_pattern+')\.?\s?)'
df = df.withColumn("Full Name", regexp_extract("Name", pattern_full, 1))
df = df.withColumn("First Name", regexp_extract("Name", pattern_name, 1))
df = df.withColumn("Surname", regexp_extract("Name", pattern_surname, 1))
df = df.withColumn("Title", regexp_extract("Name", pattern_title, 1))
df.show(10, False)
+------------------------------+-----------------------+----------+------------+-----+
|Name |Full Name |Surname |First Name |Title|
+------------------------------+-----------------------+----------+------------+-----+
|Braund, Mr. Owen other stuff |Braund, Mr. Owen |Braund |Owen |Mr |
|Cumings, Mrs. Joh some details|Cumings, Mrs. Joh |Cumings |Joh |Mrs |
|Heikkinen, Miss. Hellen blah |Heikkinen, Miss. Hellen|Heikkinen |Hellen |Miss |
|Futrelle, Mrs. Ja .... .... |Futrelle, Mrs. Ja |Futrelle |Ja |Mrs |
+------------------------------+-----------------------+----------+------------+-----+
这完全是关于在正则表达式中忽略哪一部分以及 select 哪一部分。希望这对您有所帮助,祝您好运!
注意:不是最佳正则表达式,它还有改进的空间。
这是可用的数据框:
+--------------------+
| Name|
+--------------------+
|Braund, Mr. Owen ...|
|Cumings, Mrs. Joh...|
|Heikkinen, Miss. ...|
|Futrelle, Mrs. Ja...|
|Allen, Mr. Willia...|
|Moran, Mr. James|
|McCarthy, Mr. Tim...|
|Palsson, Master. ...|
|Johnson, Mrs. Osc...|
+--------------------+
我想使用 Pyspark 在 DATA FRAME 的每一行中找到第一次出现的 Title 和 Surname(Pandas lib 在我的集群中不可用)。
pattern=re.compile(r'(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.'
pattern.match(df['Name'])
如果 Name
列的第一个词是 'Surname' 那么您可以试试这个,否则正则表达式需要稍微调整一下。
from pyspark.sql.functions import regexp_extract, col
#sample data
df= sc.parallelize([["Braund, Mr. Owen"],
["Cumings, Mrs. Joh"],
["Heikkinen, Miss."],
["Futrelle, Mrs. Ja"]]).toDF(["Name"])
df = df.withColumn('Surname', regexp_extract(col('Name'), '(\S+),.*', 1))
df.show()
示例数据:
+-----------------+
| Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+
输出为:
+-----------------+---------+
| Name| Surname|
+-----------------+---------+
| Braund, Mr. Owen| Braund|
|Cumings, Mrs. Joh| Cumings|
| Heikkinen, Miss.|Heikkinen|
|Futrelle, Mrs. Ja| Futrelle|
+-----------------+---------+
您可以按照@Prem 的建议使用 regexp_extract
,但使用不同的正则表达式模式,具体取决于您的需要:
# do not keep the first two groups, just what follows, the surname:
pattern = r'(?:(?:Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'
# or keep both title and surname
pattern_with_title = r'((Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'
#sample data
df = spark.createDataFrame([["Braund, Mr. Owen other stuff"],
["Cumings, Mrs. Joh some details"],
["Heikkinen, Miss. Hellen blah"],
["Futrelle, Mrs. Ja .... .... "]], ["Name"])
df.show()
+-----------------+
| Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+
# create a column with what matches the pattern
df = df.withColumn("Surname", regexp_extract("Name", pattern, 1))
df.show()
# keeps only the Surname
+-----------------+---------+
| Name| Surname|
+-----------------+---------+
| Braund, Mr. Owen| Owen |
|Cumings, Mrs. Joh| Joh |
| Heikkinen, Miss.| Hellen |
|Futrelle, Mrs. Ja| Ja |
+-----------------+---------+
# in case you want both title and Surname:
df = df.withColumn("Surname with title", regexp_extract("Name", pattern_with_title, 1))
+-----------------+---------+--------------------+
| Name| Surname| Surname with title|
+-----------------+---------+--------------------+
|Braund, Mr. Owen | Owen | Mr. Ownen |
|Cumings, Mrs. Joh| Joh | Mrs. Joh |
|Heikkinen, Miss..| Hellen | Miss. Hellen |
|Futrelle, Mrs. Ja| Ja | Mrs. Ja |
+-----------------+---------+--------------------+
如果您需要全名、头衔、姓氏,请稍微更改模式以将其也包括在内,例如:
main_pattern = r'Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don'
pattern_full = r'(\w+,?\s('+ main_pattern+')\.?\s?\w+)'
pattern_name = r'(?:(?:'+ main_pattern+')\.?\s?)(\w+)'
pattern_title = r'(?:('+ main_pattern+')\.?\s?)'
pattern_surname = r'(\w+)(?:\,\s?(?:'+ main_pattern+')\.?\s?)'
df = df.withColumn("Full Name", regexp_extract("Name", pattern_full, 1))
df = df.withColumn("First Name", regexp_extract("Name", pattern_name, 1))
df = df.withColumn("Surname", regexp_extract("Name", pattern_surname, 1))
df = df.withColumn("Title", regexp_extract("Name", pattern_title, 1))
df.show(10, False)
+------------------------------+-----------------------+----------+------------+-----+
|Name |Full Name |Surname |First Name |Title|
+------------------------------+-----------------------+----------+------------+-----+
|Braund, Mr. Owen other stuff |Braund, Mr. Owen |Braund |Owen |Mr |
|Cumings, Mrs. Joh some details|Cumings, Mrs. Joh |Cumings |Joh |Mrs |
|Heikkinen, Miss. Hellen blah |Heikkinen, Miss. Hellen|Heikkinen |Hellen |Miss |
|Futrelle, Mrs. Ja .... .... |Futrelle, Mrs. Ja |Futrelle |Ja |Mrs |
+------------------------------+-----------------------+----------+------------+-----+
这完全是关于在正则表达式中忽略哪一部分以及 select 哪一部分。希望这对您有所帮助,祝您好运!
注意:不是最佳正则表达式,它还有改进的空间。