期望将哪些 numpy 结构作为使用 numpy.char 函数的输入？

Question

考虑一个由字符串数组组成的 numpy 数组（至少我最接近如何做到这一点）：

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
print(ff.dtype)
<U4

但是这些显然不能与 numpy.char 方法一起使用.. ?

ffc = ff.astype('S5')
fff = np.char.split(ffc,':')[1]


Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "/usr/local/lib/python3.7/site-packages/numpy/core/defchararray.py", line 1447, in split
    a, object_, 'split', [sep] + _clean_args(maxsplit))
TypeError: a bytes-like object is required, not 'numpy.str_'

类型 <U4 和 .str_ 有什么区别？显示的字符串如何被 np.char.** 解析？

Answer 1

首先，np.char 函数用于 chararrays，它应该用 np.char.array 或 np.char.asarray 构造（参见 docs） .

因此，您给定的代码将像这样工作：

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
ffc = np.char.asarray(ff)
fff = np.char.split(ffc, ':')[1]

print(fff)

输出：

[list(['g', 'hi']) list(['j', 'kl'])]

此转换是隐式执行的，因此实际上也可以：

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
fff = np.char.split(ff, ':')[1]

为了完整起见，您关于 <U4 与 S5 的附属问题：

一个numpydtype和U表示一个unicode字符串，也就是recommended way of representing strings。另一方面，S 表示一个以 null 结尾的字节数组。

我怀疑字符串方法是在 Python 对象上执行的，因此您需要一个 Python 类似字符串的类型（知道它自己的长度等）而不是 "dumb" C 类字符串字节数组。

Answer 2

参数中的字符串类型必须与数组中的类型匹配：

In [44]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])                            
In [45]: ff                                                                          
Out[45]: 
array([['a:bc', 'd:ef'],
       ['g:hi', 'j:kl']], dtype='<U4')
In [46]: np.char.split(ff,':')                                                       
Out[46]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl'])]], dtype=object)
In [47]: np.char.split(ff.astype('S5'),b':')                                         
Out[47]: 
array([[list([b'a', b'bc']), list([b'd', b'ef'])],
       [list([b'g', b'hi']), list([b'j', b'kl'])]], dtype=object)

'U4' 是 unicode，Py3 的默认字符串类型。 'S4' 是 bytestring，Py2 的默认类型。 b':' 是字节串，u':' 是 unicode。

这个 np.char.split 使用起来有点笨拙，因为结果是对象 dtype，带有拆分字符串的列表。

要获得 2 个单独的数组，我将使用 frompyfunc 应用解包：

In [50]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_46)                         
Out[50]: 
(array([['a', 'd'],
        ['g', 'j']], dtype=object), array([['bc', 'ef'],
        ['hi', 'kl']], dtype=object))
In [51]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_47)                         
Out[51]: 
(array([[b'a', b'd'],
        [b'g', b'j']], dtype=object), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype=object))

尽管要获取字符串 dtype 数组，我仍然会使用 astype:

In [52]: _50[0].astype('U4')                                                         
Out[52]: 
array([['a', 'd'],
       ['g', 'j']], dtype='<U4')

我可以通过提供 otypes（甚至是 dtypes 的混合！）将解包和 astype 与 np.vectorize 结合起来：

In [53]: np.vectorize(lambda alist:tuple(alist), otypes=['U4','S4'])(_46)            
Out[53]: 
(array([['a', 'd'],
        ['g', 'j']], dtype='<U1'), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype='|S2'))

通常 frompyfunc 比 vectorize 快。

如果拆分创建不同长度的列表，则此解包将不起作用：

In [54]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl:xyz']])                        
In [55]: np.char.split(ff,':')                                                       
Out[55]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

===

有了 chararray，所有这些 np.char 函数都可以作为方法使用。

In [59]: np.char.asarray(ff)                                                         
Out[59]: 
chararray([['a:bc', 'd:ef'],
           ['g:hi', 'j:kl:xyz']], dtype='<U8')
In [60]: np.char.asarray(ff).split(':')                                              
Out[60]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

请参阅 np.char 文档中的注释：

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

期望将哪些 numpy 结构作为使用 numpy.char 函数的输入？

What numpy structures are expected as inputs to use numpy.char functions?

python

string

numpy

numpy-ndarray