将包含在 %% 中的字符串转换为 python 中的小写
Convert string enclosed in %% to lower case in python
我有 pyspark 数据框,其中一个字段的值包含在 %%..%% 中。所附内容不包括在案例中。我想把它们转换成小写。
下面是数据帧的快照。
栏中的文字如下所示
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%
我想将上面的文字转换为以下格式:
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%
只有用 %% 括起来的字符串要转换为小写
由于字符串在 Python 中是不可变的,您将不得不重新分配新值。因此,我认为,你最好只遍历字符串(因为在评论中你说你想避免 split
)。
我在想这样的事情
new=''
f=0
for i in textstr:
if i == '%':
f += 1
if (f/2)%2 == 1:
new+=i.lower()
else:
new+=i
或者使用正则表达式
您可以使用一个简单的正则表达式:
- 找到所有要替换的序列
- 将每个序列替换为对应的小写字母
import re
link1 = 'https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%'
link2 = 'https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%'
links = [link1, link2]
for idx, link in enumerate(links):
lowers = re.findall(r'%%.*?%%', link)
for x in lowers:
links[idx] = re.sub(r'%%.*?%%', x.lower(), link)
for link in links:
print(link)
输出:
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%
使用@mentalita 建议的正则表达式
input_df:
>>> df.show(truncate=False)
+----+---------------------------------+
|col1|col2 |
+----+---------------------------------+
|1 |http://%%FOO%%|some_string%%BAR%%|
|2 |http://%%FOO%%|some_string |
+----+---------------------------------+
代码:
def convert_to_lower(link):
target_strings = re.findall(r'%%.*?%%', link)
for x in target_strings:
link = re.sub(x, x.lower(), link)
return link
convert_to_lower_udf = F.udf(lambda x: convert_to_lower(x))
df = df\
.withColumn('converted_strings', convert_to_lower_udf('col2'))
output_df:
>>> df.show(truncate=False)
+----+---------------------------------+---------------------------------+
|col1|col2 |converted_strings |
+----+---------------------------------+---------------------------------+
|1 |http://%%FOO%%|some_string%%BAR%%|http://%%foo%%|some_string%%bar%%|
|2 |http://%%FOO%%|some_string |http://%%foo%%|some_string |
+----+---------------------------------+---------------------------------+
我有 pyspark 数据框,其中一个字段的值包含在 %%..%% 中。所附内容不包括在案例中。我想把它们转换成小写。
下面是数据帧的快照。
栏中的文字如下所示
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%
我想将上面的文字转换为以下格式:
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%
只有用 %% 括起来的字符串要转换为小写
由于字符串在 Python 中是不可变的,您将不得不重新分配新值。因此,我认为,你最好只遍历字符串(因为在评论中你说你想避免 split
)。
我在想这样的事情
new=''
f=0
for i in textstr:
if i == '%':
f += 1
if (f/2)%2 == 1:
new+=i.lower()
else:
new+=i
或者使用正则表达式
您可以使用一个简单的正则表达式:
- 找到所有要替换的序列
- 将每个序列替换为对应的小写字母
import re
link1 = 'https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%'
link2 = 'https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%'
links = [link1, link2]
for idx, link in enumerate(links):
lowers = re.findall(r'%%.*?%%', link)
for x in lowers:
links[idx] = re.sub(r'%%.*?%%', x.lower(), link)
for link in links:
print(link)
输出:
https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%
使用@mentalita 建议的正则表达式
input_df:
>>> df.show(truncate=False)
+----+---------------------------------+
|col1|col2 |
+----+---------------------------------+
|1 |http://%%FOO%%|some_string%%BAR%%|
|2 |http://%%FOO%%|some_string |
+----+---------------------------------+
代码:
def convert_to_lower(link):
target_strings = re.findall(r'%%.*?%%', link)
for x in target_strings:
link = re.sub(x, x.lower(), link)
return link
convert_to_lower_udf = F.udf(lambda x: convert_to_lower(x))
df = df\
.withColumn('converted_strings', convert_to_lower_udf('col2'))
output_df:
>>> df.show(truncate=False)
+----+---------------------------------+---------------------------------+
|col1|col2 |converted_strings |
+----+---------------------------------+---------------------------------+
|1 |http://%%FOO%%|some_string%%BAR%%|http://%%foo%%|some_string%%bar%%|
|2 |http://%%FOO%%|some_string |http://%%foo%%|some_string |
+----+---------------------------------+---------------------------------+