如何将字符串拆分为列表并将两个已知标记合并为 python 中的一个？

Question

对于给定的字符串，如：

"Today is a bright sunny day in New York"

我想让我的清单是：

['Today','is','a','bright','sunny','day','in','New York']

另一个例子：

"This is a hello world program"

名单是： ['This', 'is', 'a', 'hello world', 'program']

对于每个给定的字符串 S，我们都有需要保持在一起的实体 E。第一个示例的实体 E 为 "New"、"York"，第二个示例的实体为 "hello"、"world".

我曾尝试通过正则表达式完成它，但我无法通过空格拆分和合并两个实体。

示例：

regex = "(navy blue)|[a-zA-Z0-9]*" match = re.findall(regex, "the sky looks navy blue.",re.IGNORECASE) print match

输出： ['', '', '', '', '', '', 'navy blue', '', '']

Answer 1

试试这个：

text = "Today is a bright sunny day in New York"
new_list = list(map(str, text.split(" ")))

这应该会给你以下输出 ['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New', 'York']

下一个字符串相同：

hello = "This is a hello world program."
yet_another_list = list(map(str, hello.split(" ")))

给你['This', 'is', 'a', 'hello', 'world', 'program.']

Answer 2

使用 re.findall 而不是 split 并在表示要提取的字符串的字符 class 之前交替提供实体

>>> s = "Today is a bright sunny day in New York"
>>> re.findall(r'New York|\w+', s)
['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New York']

>>> s = "This is a hello world program"
>>> re.findall(r'hello world|\w+', s)
['This', 'is', 'a', 'hello world', 'program']

将 \w 更改为适当的字符 class，例如：[a-zA-Z]

对于添加到问题的附加示例

>>> regex = r"navy blue|[a-z\d]+"
>>> re.findall(regex, "the sky looks navy blue.", re.IGNORECASE)
['the', 'sky', 'looks', 'navy blue']

使用 r 个字符串构建正则表达式模式是一种很好的做法
这里不需要分组
使用 + 而不是 * 以便至少匹配一个字符
既然指定了re.IGNORECASE，字符class中的a-z或A-Z就足够了。也可以将 re.I 用作 short-cut
\d 是 short-cut 对于 [0-9]

Answer 3

"this is hello word program".split(' ')

拆分会自动生成一个列表。您可以使用任何字符串或单词或字符进行拆分。

如何将字符串拆分为列表并将两个已知标记合并为 python 中的一个？

How to split a string into list and combine two known token into one in python?

python

regex

split

tokenize