如何将这个带有表情符号标志的奇怪结构字符串列表处理成字典?

How to process this list of oddly structured strings with emoji flags into a dictionary?

假设我们有一个具有定义结构的字符串列表。 解析此类列表以获取字典的最简单策略是什么?

mylist = [
    'Zynex 0,6',
    'PayPal 11',
    'PetIQ 0,5',
    'First Solar 0,7',
    'Upwork 1%',
    'NV5 Global 0,8',
    'TPI Composites 1',
    'Fiserv 0,5',
]

我正在寻找结果:

{
    'Zynex': 0.6,
    'PayPal': 11.0,
    'PetIQ': 0.5,
    'First Solar': 0.7,
    'Upwork': 1.0,
    'NV5 Global': 0.8,
    'TPI Composites': 1.0,
    'Fiserv': 0.5,
}

希望这段代码对您有所帮助:

mylist = ['Zynex 0,6',
 'PayPal 11',
 'PetIQ 0,5',
 'First Solar 0,7',
 'Upwork 1%',
 'NV5 Global 0,8',
 'TPI Composites 1',
 'Fiserv 0,5']

edited = []
dicto = {}
for val in mylist:
     new_val = val[2:]
     edited.append(new_val)
for i, val in enumerate(edited):
    tmp = val.rsplit(' ', 1)
    dicto[tmp[0]] = tmp[1]
print(dicto)

我假设该结构包括没有空格的数字部分作为字符串的最后一个组成部分,并且您想从字符串的前导部分中删除 'us'。

您想要的基本过程是遍历原始列表,在每次遍历时执行以下操作:

  1. 将字符串分成键和值部分。
  2. 清除不需要的东西的价值。
  3. 清除密钥中不需要的东西。
  4. 将 key:value 对添加到字典中。

类似这样的东西,但我没有处理带有百分比的值:

my_list = ['Zynex 0,6',
 'PayPal 11',
 'PetIQ 0,5',
 'First Solar 0,7',
 #'Upwork 1%',
 'NV5 Global 0,8',
 'TPI Composites 1',
 'Fiserv 0,5']

##strip the 'us'
my_list = [x[2:] for x in my_list]
print(my_list[0].lstrip('us'))
##create a dictionary
my_dict = {}

## Now iterate over my_list and add key,value pairs to my_dict.
for e in my_list:
   ## make a list of the string, split on whitespace
   e = e.split()
   ## get the final element as value
   value = e[-1]
   ## replace commas with periods in value
   ## and convert to a float.
   value = float(value.replace(',','.'))
   ##join the rest of e into the key part.
   key = ' '.join(e[:-1])
   my_dict[key] = value

其实很简单:

import re

mylist = [
    'Zynex 0,6',
    'PayPal 11',
    'PetIQ 0,5',
    'First Solar 0,7',
    'Upwork 1%',
    'NV5 Global 0,8',
    'TPI Composites 1',
    'Fiserv 0,5',
]

res = {}
for elem in mylist:
    key, val = re.sub(r"[^A-Za-z0-9, ]", "", elem).rsplit(" ", 1)
    res[key] = float(val.replace(",", "."))
 
print(res)

Output:

{'Zynex': 0.6, 'PayPal': 11.0, 'PetIQ': 0.5, 'First Solar': 0.7, 'Upwork': 1.0, 'NV5 Global': 0.8, 'TPI Composites': 1.0, 'Fiserv': 0.5}

编辑: 根据您的评论,您还希望获得旗帜表情符号的文本表示。粗略的解决方案是这样的:

def flag_to_str(emoji):
    return "".join(chr(c - 101) for c in emoji.encode()[3::4])


print(flag_to_str(""))  # US
print(flag_to_str(""))  # FI

# How it works:
print("".encode())  # b'\xf0\x9f\x87\xba\xf0\x9f\x87\xb8'
print("".encode()[3::4])  # b'\xba\xb8'
print("".encode()[3::4][0])  # 186
print(chr("".encode()[3::4][0] - 101))  # U

说明: 大多数旗帜表情符号被编码为两个 regional indicator symbols. E.g. is + , and in hexadecimal that is represented as f0 9f 87 ba f0 9f 87 b8 (https://onlineutf8tools.com/convert-utf8-to-hexadecimal?input=&prefix=false&padding=false&spacing=true). From there we can see that each regional symbol starts with f0 9f 87, and the fourth byte is the amount 101₁₀ added to the equivalent ASCII uppercase character: https://www.asciitable.com 的序列。因此 0xba <=> 186₁₀ - 101₁₀ = U.