从元组列表中替换 pandas 系列中的值

Question

问题：想要用元组列表中的值替换 pandas 系列中的值，其中特定系列值应从元组的第一个值开始。下面是一个例子：

a = pd.Series(['New Delhi', 'Old Bombay', None, 'Banaras'])

b = [('New','Old'), ('Old','New'),('Banaras','Varanasi'), ('abc','xyz')]

Required Output :
0         Old
1         New
2        None
3    Varanasi
dtype: object

我在下面尝试过，它的效果很好，但是由于应用失去了矢量化的好处，因此速度很慢。

def test(x):
    if x is not None:
        for i in b:
            if x.startswith(i[0]):
                return i[1]
        return x
    return x


a.apply(test)

I tried list comprehension that also works but is still slow. 

pd.Series([test(x) for x in a])

有没有更好的方法在不失去矢量化优势的情况下实现这一点？

Answer 1

不确定这是否更快，但这是替代方案：

a.str.partition(' ').iloc[:,0].replace(*zip(*b))

结果：

0         Old
1         New
2        None
3    Varanasi
Name: 0, dtype: object

更新包含空格的起始值:

to_replace,value = zip(*b)
to_replace = [f'^{x}.*$' for x in to_replace]
a.replace(to_replace, value, regex=True)

示例：

a = pd.Series(['New Delhi', 'Old Bombay', None, 'Banaras', 'Greater city'])
b = [('New','Old'), ('Old','New'),('Banaras','Varanasi'), ('abc','xyz'), ('Greater city', 'Great' )]

结果：

0         Old
1         New
2        None
3    Varanasi
4       Great
dtype: object

Answer 2

在互联网上进行了大量的努力和研究之后，了解到 np.select 是更好的方法并且是矢量化的。以下是在系列 (a) 足够大的情况下表现出色的解决方案。以下是解决方案：

a = pd.Series(np.random.choice(['New Delhi', 'Old Bombay', None, 'Banaras'], replace=True, size=1000000))

case = [a.str.upper().str.startswith(i[0], na=False) for i in b]

replace = [i[1] for i in b]

%timeit -n10 -r10 pd.Series(np.select(case, replace, default=a))
10 loops, best of 10: 75.3 ms per loop

从元组列表中替换 pandas 系列中的值

Replace values in a pandas series from list of tuples

python

vectorization

series

pandas