按字符分隔 python 字符串，同时保持内联标签完好无损

Question

我正在尝试在 python 中创建一个与内嵌标签一起使用的自定义分词器。目标是采用这样的字符串输入：

'This is *tag1* a test *tag2*.'

并让它输出由标签和字符分隔的列表：

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

没有标签，我只使用 list()，我想我找到了处理单一标签类型的解决方案，但有多个。还有其他多字符段，例如省略号，应该被编码为单个特征。
我尝试的一件事是用正则表达式用一个未使用的字符替换标签，然后在字符串上使用 list():

text = 'This is *tag1* a test *tag2*.'
tidx = re.match(r'\*.*?\*', text)
text = re.sub(r'\*.*?\*', r'#', text)
text = list(text)

然后我会迭代它并用提取的标签替换'#'，但我有多个不同的特征我试图提取，并且在拆分字符串之前用不同的占位符重复这个过程看起来像不好的做法。有没有更简单的方法来做这样的事情？我对此还是很陌生，所以还有很多我不知道的常用方法。我想我也可以使用一个更大的正则表达式，它包含我试图提取的所有特征，但它仍然感觉很老套，我更愿意使用更模块化的东西，可以用来找到其他特征而无需编写新表达式每一次。

Answer 1

您可以将以下正则表达式与 re.findall 一起使用：

\*[^*]*\*|.

见regex demo。 re.S 或 re.DOTALL 标志可以与此模式一起使用，以便 . 也可以匹配默认情况下不匹配的换行符字符。

详情

\*[^*]*\* - * 字符，后跟 * 以外的零个或多个字符，然后是 *
| - 或
. - 任何一个字符（re.S）。

参见 Python demo:

import re
s = 'This is *tag1* a test *tag2*.'
print( re.findall(r'\*[^*]*\*|.', s, re.S) )
# => ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

Answer 2

我不确定什么最适合您，但您应该能够使用下面展示的 split() 方法或 .format() 方法来获得您想要的。

# you can use this to get what you need
txt = 'This is *tag1* a test *tag2*.'
x = txt.split("*") #Splits up at *
x = txt.split() #Splits all the words up at the spaces
print(x)

# also, you may be looking for something like this to format a string
mystring = 'This is {} a test {}.'.format('*tag1*', '*tag2*')
print(mystring)


# using split to get ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
txt = 'This is *tag1* a test *tag2*.'
split = txt.split("*") #Splits up at *

finallist = [] # initialize the list
for string in split:

    # print(string)
    if string == '*tag1*':
        finallist.append(string)
        # pass
    elif string == '*tag2*.':
        finallist.append(string)

    else:
        for x in range(len(string)):
            letter = string[x]
            finallist.append(letter)

print(finallist)

按字符分隔 python 字符串，同时保持内联标签完好无损

Seperating a python string by character while keeping inline tags intact

python

regex

nlp

data-cleaning