Pyspark 字符串和对象列表
Pyspark String and list of objects
我有一个字符串
https://hdchjhjedjekdn.com/{}_public.xml 有一个占位符,我有对象列表
201611339349202661,
201611309349201761,
201543179349200944,
201631099349200733,
201610909349200511,
201630749349201058,
201601319349200235,
201641069349200909,
201542999349200004,
201611319349201771,
201641329349200119,
201513219349200536,
201543159349201769,
201612029349200631,
201621339349202247,
201611259349200506,
201611829349200301,
201543169349201114,
201543209349204979,
201641039349200509,
201621309349200642,
201512789349200031,
201601939349200520
我想用对象列表填充占位符。
喜欢:
s = (https://hdchjhjedjekdn.com/201611339349202661_public.xml, https://hdchjhjedjekdn.com/201611309349201761_public.xml, https://hdchjhjedjekdn.com/201543179349200944_public.xml,........)
任何帮助将不胜感激使用 pyspark
一个简单的列表理解应该可以解决问题,因为它不是 RDD -
url_mod = ["https://hdchjhjedjekdn.com/{}_public.xml".format(x) for x in ids]
并行化您的 ID 列表并在工作人员之间广播 url 字符串。然后应用映射来创建格式化字符串,
>>>l = [201611339349202661, 201611309349201761, 201543179349200944, 201631099349200733, 201610909349200511, 201630749349201058, 201601319349200235, 201641069349200909, 201542999349200004, 201611319349201771, 201641329349200119, 201513219349200536, 201543159349201769, 201612029349200631, 201621339349202247, 201611259349200506, 201611829349200301, 201543169349201114, 201543209349204979, 201641039349200509, 201621309349200642, 201512789349200031, 201601939349200520]
>>>rdd = sc.parallelize(l)
>>>rdd.getNumPartitions()
12 ## I have used 12 workers
>>>brd_url = sc.broadcast('https://hdchjhjedjekdn.com/{}_public.xml')
>>>rdd1 = rdd.map(lambda x:brd_url.value.format(x))
>>>rdd1.take(2)
['https://hdchjhjedjekdn.com/201611339349202661_public.xml', 'https://hdchjhjedjekdn.com/201611309349201761_public.xml']
希望对您有所帮助。
我有一个字符串
https://hdchjhjedjekdn.com/{}_public.xml 有一个占位符,我有对象列表
201611339349202661, 201611309349201761, 201543179349200944, 201631099349200733, 201610909349200511, 201630749349201058, 201601319349200235, 201641069349200909, 201542999349200004, 201611319349201771, 201641329349200119, 201513219349200536, 201543159349201769, 201612029349200631, 201621339349202247, 201611259349200506, 201611829349200301, 201543169349201114, 201543209349204979, 201641039349200509, 201621309349200642, 201512789349200031, 201601939349200520
我想用对象列表填充占位符。
喜欢:
s = (https://hdchjhjedjekdn.com/201611339349202661_public.xml, https://hdchjhjedjekdn.com/201611309349201761_public.xml, https://hdchjhjedjekdn.com/201543179349200944_public.xml,........)
任何帮助将不胜感激使用 pyspark
一个简单的列表理解应该可以解决问题,因为它不是 RDD -
url_mod = ["https://hdchjhjedjekdn.com/{}_public.xml".format(x) for x in ids]
并行化您的 ID 列表并在工作人员之间广播 url 字符串。然后应用映射来创建格式化字符串,
>>>l = [201611339349202661, 201611309349201761, 201543179349200944, 201631099349200733, 201610909349200511, 201630749349201058, 201601319349200235, 201641069349200909, 201542999349200004, 201611319349201771, 201641329349200119, 201513219349200536, 201543159349201769, 201612029349200631, 201621339349202247, 201611259349200506, 201611829349200301, 201543169349201114, 201543209349204979, 201641039349200509, 201621309349200642, 201512789349200031, 201601939349200520]
>>>rdd = sc.parallelize(l)
>>>rdd.getNumPartitions()
12 ## I have used 12 workers
>>>brd_url = sc.broadcast('https://hdchjhjedjekdn.com/{}_public.xml')
>>>rdd1 = rdd.map(lambda x:brd_url.value.format(x))
>>>rdd1.take(2)
['https://hdchjhjedjekdn.com/201611339349202661_public.xml', 'https://hdchjhjedjekdn.com/201611309349201761_public.xml']
希望对您有所帮助。