将字符串映射到 ASCII 中位置为 table 的整数

Question

我有这样的字符串：

word = 'python'

基于string.ascii_lowercase，我想创建一个如下所示的新数组：

[15, 24, 19, 7, 14, 13]

我对这个问题的解决方案是执行以下操作：

alphabet = {char: i for i, char in enumerate(string.ascii_lowercase)}
indices = [alphabet[char] for char in word]
print(indices)

输出：[15, 24, 19, 7, 14, 13]

但我正在寻找一种不使用循环的更有效的方法。我怎样才能以 向量化 方式做到这一点？

Answer 1

一种方法是使用 np.fromiter 并指定 'U1' dtype 从字符串构建一个数组，转换为整数，然后减去 unicode 中字母表的起始位置 table, 97 或者我们可以只使用 ord('a')：正如 Antoine Dubuis 所建议的：

import numpy as np
word = 'python'
np.fromiter(word, dtype='U1').view(np.uint32) - ord('a')
array([15, 24, 19,  7, 14, 13])

Answer 2

我们可以使用 np.frombuffer 来提高效率 -

import numpy as np

np.frombuffer(word.encode(), dtype=np.uint8)-97

1M 长字符串的时间：

In [23]: import string

In [24]: p = string.ascii_lowercase

In [25]: word = ''.join([p[i] for i in np.random.randint(0,len(p), 1000000)])

In [26]: %timeit np.frombuffer(word.encode(), dtype=np.uint8)-97
136 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# @yatu's soln with np.fromiter
In [27]: %timeit np.fromiter(word, dtype='U1').view(np.uint32) - ord('a')
24.8 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

将字符串映射到 ASCII 中位置为 table 的整数

Map string to integers with position in ASCII table

python

string

unicode

numpy

vectorization