有没有更好的方法来标记一些字符串?

Is there a better way to tokenize some strings?

我试图为某些 NLP 在 python 中编写字符串标记化代码,并想出了这个代码:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
    s.append([])
    s[a].append(line.split())
    a+=1
print(s)

输出结果为:

[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]

如您所见,该列表现在有一个额外的维度,例如,如果我想要单词 'Batman',我必须键入 s[0][0][2] 而不是 s[0][2],所以我将代码更改为:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
    s.append([])
    m=(line.split())
    for word in m:
        s[a].append(word)
    a += 1
print(s)

这让我得到了正确的输出:

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

但我觉得这可以用一个循环工作,因为我将要导入的数据集会非常大,n 的复杂度会比 n^2, 那么,有没有更好的方法 this/a 用一个循环来做到这一点?

您应该对循环中的每个字符串使用 split()

列表理解示例:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']

[s.split() for s in str]

[['I', 'am', 'Batman.'],
 ['I', 'loved', 'the', 'tea.'],
 ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

看到这个:-

>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]  
# split by default slits on whitespace strings and give output as list

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

你的原始代码就快到了。

>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
...   s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

line.split() 给你一个列表,所以把它附加到你的循环中。 或者直接去理解一下:

[line.split() for line in str]

当你说 s.append([]) 时,索引 'a' 处有一个空列表,如下所示:

L = []

如果您将 split 的结果附加到该结果,例如 L.append([1]) 您最终会在该列表中得到一个列表:[[1]]