常用词高效查找

Question

我有一个分为单词的名称（字符串）列表。有 800 万个名字，每个名字最多由 20 个单词（token）组成。唯一代币数量为 220 万。我需要一种有效的方法来从查询中查找包含至少一个单词的所有名称（可能还包含最多 20 个单词，但通常只有几个）。

我目前的方法使用 Python Pandas 并且看起来像这样（后来称为 original）：

>>> df = pd.DataFrame([['foo', 'bar', 'joe'], 
                       ['foo'], 
                       ['bar', 'joe'], 
                       ['zoo']], 
                      index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)  # btw, is there a way to include this into prev line?
>>> print df

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id3  bar   joe  None
id4  zoo  None  None

def filter_by_tokens(df, tokens):
    # search within each column and then concatenate and dedup results    
    results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
    return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)

>>> print filter_by_tokens(df, ['foo', 'zoo'])

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id4  zoo  None  None

目前这种查找（在完整数据集上）在我的（相当强大的）机器上需要 5.75 秒。我想至少加速 10 倍。

通过将所有列压缩为一个列并对其执行查找（后来称为 original, squeezed），我能够达到 5.29s：

>>> df = pd.Series([{'foo', 'bar', 'joe'}, 
                    {'foo'}, 
                    {'bar', 'joe'}, 
                    {'zoo'}], 
                    index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df

id
id1    {foo, bar, joe}
id2              {foo}
id3         {bar, joe}
id4              {zoo}
dtype: object

def filter_by_tokens(df, tokens):
    return df[df.map(lambda x: bool(x & set(tokens)))]

>>> print filter_by_tokens(df, ['foo', 'zoo'])

id
id1    {foo, bar, joe}
id2              {foo}
id4              {zoo}
dtype: object

但这仍然不够快。

另一个似乎很容易实现的解决方案是使用 Python 多处理（由于 GIL，线程在这里应该没有帮助而且没有 I/O，对吧？）。但它的问题是大数据帧需要复制到每个进程，占用所有内存。另一个问题是我需要在一个循环中多次调用 filter_by_tokens，所以它会在每次调用时复制数据帧，这是低效的。

请注意，单词可能会在名称中出现多次（例如，最流行的单词在名称中出现 60 万次），因此反向索引会很大。

有什么好的方法可以有效地编写它？ Python 首选解决方案，但我也对其他语言和技术（例如数据库）持开放态度。

更新： 我已经测量了我的两个解决方案和@piRSquared 在他的中建议的 5 个解决方案的执行时间。以下是结果（tl;dr 最好的是 2 倍的改进）：

+--------------------+----------------+
|       method       | best of 3, sec |
+--------------------+----------------+
| original           | 5.75           |
| original, squeezed | 5.29           |
| zip                | 2.54           |
| merge              | 8.87           |
| mul+any            | MemoryError    |
| isin               | IndexingError  |
| query              | 3.7            |
+--------------------+----------------+

mul+any 在 d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()（在 128Gb RAM 机器上）上给出 MemoryError。

isin 在 s[d1.isin({'zoo', 'foo'}).unstack().any(1)] 上给出 IndexingError: Unalignable boolean Series key provided，显然是因为 df.stack().isin(set(tokens)).unstack() 的形状略小于原始数据帧的形状（8.39M 对 8.41M 行），不确定为什么以及如何解决这个问题。

请注意，我使用的机器有 12 个内核（尽管我在上面提到了并行化的一些问题）。所有解决方案都使用单核。

结论（截至目前）：zip（2.54 秒）与 original squeezed 解决方案（5.29 秒）相比提高了 2.1 倍。这很好，但如果可能的话，我的目标是至少提高 10 倍。所以我暂时不接受（仍然很棒的）@piRSquared 的回答，欢迎更多的建议。

Answer 1

想法0
zip

def pir(s, token):
    return s[[bool(p & token) for p in s]]

pir(s, {'foo', 'zoo'})

想法 1
merge

token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]

想法2
mul + any

d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]

想法3
isin

d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]

想法4
query

token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]

Answer 2

我用以下工具做过类似的事情

Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON

Apache Lucene - 非常适合 800 万条记录

Answer 3

你可以用反向索引来做； pypy 中运行下面的代码在 57 秒内建立了索引，查询或 20 个单词需要 0.00018 秒并使用了大约 3.2Gb 内存。 Python 2.7 在 158 秒内建立索引并在 0.0013 秒内使用大约 3.41Gb 内存进行查询。

最快的方法是使用位图反向索引，压缩以保存 space。

"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time   records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
    for x in range(random.randint(1,20)):
        wordno = random.randint(0,wordlimit)
        try:
           wordlists[wordno].add(recno)
        except:
           wordlists[wordno] = set([recno])
        total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
    while 1:
        wordno = (random.randint(0,words))
        if  wordno in wordlists: # only query words that were used
            if  not wordno in query:
                query.add(wordno)
                break
print "query", query
starttime = time.time()
for wordno in query:
    querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
    print "record", recno, "matches"

Answer 4

也许我的第一个回答有点抽象；在没有真实数据的情况下，它生成的随机数据约为。音量要求了解查询时间。这段代码很实用

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].add(x)
        except:
           wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
    print data[x]

时间和内存的成本是建立词表（倒排索引）；查询几乎是免费的。你有 12 核机器，所以大概它有足够的内存。为了可重复性，构建单词列表，pickle 每个单词列表并写入 sqlite 或任何 key/value 数据库，将单词作为键并将 pickle 集作为二进制 blob。那么你只需要：

initialise_database()
query = [ "foo", "zoo" ]
results = set()                             
for q in query:                             
    wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
    if  wordlist:                        
        results = results.union(wordlist)
l = list(results)
l.sort()   
for x in l:      
    print data[x]

或者，使用数组，内存效率更高，并且可能更快地建立索引。 pypy 比 2.7

快 10 倍以上

import array

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].append(x)
        except:
           wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        for i in wordlist:
            results.add(i)
l = list(results)
l.sort()
for x in l:
    print data[x]

Answer 5

如果您知道您将看到的唯一标记的数量相对较少，您可以很容易地构建一个有效的位掩码来查询匹配项。

天真的方法（最初的 post）将允许最多 64 个不同的标记。

下面的改进代码使用像布隆过滤器一样的位掩码（设置位环绕 64 的模块化算法）。如果超过64个唯一令牌，会出现一些误报，下面的代码会自动验证（使用原始代码）。

现在，如果唯一令牌的数量（远）大于 64，或者如果您特别不走运，最坏情况下的性能将会下降。哈希可以缓解这种情况。

就性能而言，使用下面的基准数据集，我得到：

原码：4.67秒

位掩码：0.30 秒

但是，当唯一标记的数量增加时，位掩码代码仍然有效，而原始代码的速度会大大降低。有了大约 70 个独特的标记，我得到类似的东西：

原码：~15秒

位掩码：0.80 秒

注意：对于后一种情况，从提供的列表构建位掩码数组所花费的时间与构建数据帧所花费的时间差不多。构建数据框可能没有真正的理由；保留它主要是为了便于与原始代码进行比较。

class WordLookerUpper(object):
    def __init__(self, token_lists):
        tic = time.time()
        self.df = pd.DataFrame(token_lists,
                    index=pd.Index(
                        data=['id%d' % i for i in range(len(token_lists))],
                        name='index'))
        print('took %d seconds to build dataframe' % (time.time() - tic))
        tic = time.time()
        dii = {}
        iid = 0
        self.bits = np.zeros(len(token_lists), np.int64)
        for i in range(len(token_lists)):
            for t in token_lists[i]:
                if t not in dii:
                    dii[t] = iid
                    iid += 1
                # set the bit; note that b = dii[t] % 64
                # this 'wrap around' behavior lets us use this
                # bitmask as a probabilistic filter
                b = dii[t]
                self.bits[i] |= (1 << b)
        self.string_to_iid = dii
        print('took %d seconds to build bitmask' % (time.time() - tic))

    def filter_by_tokens(self, tokens, df=None):
        if df is None:
            df = self.df
        tic = time.time()
        # search within each column and then concatenate and dedup results    
        results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
        results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
        print('took %0.2f seconds to find %d matches using original code' % (
                time.time()-tic, len(results)))
        return results

    def filter_by_tokens_with_bitmask(self, search_tokens):
        tic = time.time()
        bitmask = np.zeros(len(self.bits), np.int64)
        verify = np.zeros(len(self.bits), np.int64)
        verification_needed = False
        for t in search_tokens:
            bitmask |= (self.bits & (1<<self.string_to_iid[t]))
            if self.string_to_iid[t] > 64:
                verification_needed = True
                verify |= (self.bits & (1<<self.string_to_iid[t]))
        if verification_needed:
            results = self.df[(bitmask > 0 & ~verify.astype(bool))]
            results = pd.concat([results,
                                 self.filter_by_tokens(search_tokens,
                                    self.df[(bitmask > 0 & verify.astype(bool))])])
        else:
            results = self.df[bitmask > 0]
        print('took %0.2f seconds to find %d matches using bitmask code' % (
                time.time()-tic, len(results)))
        return results

做一些测试数据

unique_token_lists = [
    ['foo', 'bar', 'joe'], 
    ['foo'], 
    ['bar', 'joe'], 
    ['zoo'],
    ['ziz','zaz','zuz'],
    ['joe'],
    ['joey','joe'],
    ['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
    token_lists.extend(unique_token_lists)

运行原码和位掩码

>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask

>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995   zoo None    None    None

>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995   zoo None    None    None

常用词高效查找

Efficient lookup by common words

python

lookup

performance

hashtable

pandas