在较大列中查找大于或等于较短搜索列中每个值的第一个值

Question

我一直在尝试寻找一种矢量化方法来获取大列（> 500k 行）中第一个值的索引大于或等于较短列（~9k 行）中每个值的索引。

目前正在循环遍历较短列中的每个值，并将其与整个较大列进行比较。循环数 = 较短列的长度。

np.random.seed(2)

veclong = np.random.randint(0, 1000, 100000)
vecshort = np.random.randint(0, 1000, 500)
dfShort=pd.DataFrame(data=vecshort/10000, columns=['Short'])
dfLong=pd.DataFrame(data=veclong/10000, columns=['Long'])

c1=len(dfShort)

out2=[];
for n1 in range(c1):
    val=dfShort['Short'].iloc[n1]
    dfAns=dfLong[dfLong>=val].dropna()
    ans=dfAns['Long'].iloc[0]
    idx=dfAns.index[0]
    out=[ans,idx]
    out2.extend([out])

out2=np.asarray(out2)
dfShort['Location']=out2[:,1]
dfShort['Value']=out2[:,0]

Answer 1

您应该考虑以下几点：

def myfunc(x):
    try:    
        return dfLong[dfLong.Long>=x].index[0]
    except:
        return None

dfShort['Location'] = dfShort.Short.apply(lambda x: myfunc(x))
dfShort['Value'] = dfShort.Location.apply(lambda x: dfLong.iloc[x, 0] if x!= None else None)
print(dfShort.head())

输出

+----+---------+-----------+--------+
|    | Short   | Location  | Value  |
+----+---------+-----------+--------+
| 0  | 0.0636  |       10  | 0.0674 |
| 1  | 0.0876  |       27  | 0.0938 |
| 2  | 0.0799  |       16  | 0.0831 |
| 3  | 0.0977  |       95  | 0.0997 |
| 4  | 0.0602  |       10  | 0.0674 |
+----+---------+-----------+--------+

在较大列中查找大于或等于较短搜索列中每个值的第一个值

Find the first value in larger column greater than or equal to each value in shorter search column

python

vectorization

pandas