查找最频繁项目的最有效数据结构

Question

我想从 Google N-Grams dataset 中提取最常用的单词，它的未压缩形式约为 20 GB。我不想使用整个数据集，只使用最常见的 5000 个。但是如果我写

take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]

这将是一个无尽的等待。但是我应该怎么做呢？

我知道有Data.Array.MArray package available for in-place array computation, but I cannot see any function for items modification on its documentation page. There is also Data.HashTable.IO，但它是无序数据结构。

我想使用简单的 Data.IntMap.Strict（具有方便的 lookupLE 功能），但我认为它不会非常有效，因为它会在每次更改时生成新地图。 ST monad 可以改进吗？

UPD: 我还在 CoreReview.SX 上发布了程序的最终版本。

Answer 1

怎么样

使用splitAt将数据集划分为前 5000 项和其余项。
按频率（升序）对前 5000 项进行排序
完成其余部分
- 如果某个项目的频率高于已排序项目中的最低频率
- 从已排序的项目中删除频率最低的项目
- 将新项目插入已排序项目中的适当位置

该过程随后变为有效线性，但如果您对已排序的 5000 个元素使用具有次线性最小删除和插入的数据结构，则系数会提高。

例如，使用Data.Heap from the heap package:

import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)

mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
  where
    -- change our pairs from (String,Int) to (Int,String)
    pairs = map swap dataset
    -- get the first `n` pairs in one list, and the rest of the pairs in another
    (first, rest) = splitAt n pairs
    -- put all the first `n` pairs into a MinHeap
    start = fromList first :: MinHeap (Int, String)
    -- then run through the rest of the pairs
    stop = foldl' step start rest
    -- modifying the heap to replace its least frequent pair
    -- with the new pair if the new pair is more frequent
    step heap pair = if viewHead heap < Just pair
                       then insert pair (fromJust $ viewTail heap)
                       else heap
    -- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
    final = map swap (toList stop)
    swap ~(a,b) = (b,a)

Answer 2

你试过这个还是你只是猜测？因为许多 Haskell 排序函数尊重惰性，当您只要求前 5000 个时，它们会很乐意避免对其余元素进行排序。

同样，对"it produces a new map on each alteration"要非常小心。大多数插入操作在这种数据结构上都是 O(log n)，n 限制为 5000：因此您可能会在每次更改时在堆中分配 ~30 个新单元，但这当然不是特别大的成本没有 5000 那么大。

如果 Data.List.sort 效果不佳，您想要的是：

import Data.List (foldl')
import Data.IntMap.Strict (IntMap)
import qualified Data.IntMap.Strict as IM

type Freq = Int
type Count = Int
data Summarizer x = Summ {tracking :: !IntMap [x], least :: !Freq, 
                        size :: !Count, size_of_least :: !Count }

inserting :: x -> Maybe [x] -> Maybe [x]
inserting x Nothing = Just [x]
inserting x (Just xs) = Just (x:xs)

sizeLimit :: Summarizer x -> Summarizer x
sizeLimit skip@(Summ strs f_l tot lst) 
    | tot - lst < 5000 = skip
    | otherwise        = Summ strs' f_l' tot' lst'
        where (discarded, strs') = IM.deleteFindMin strs
              (f_l', new_least) = IM.findMin dps'
              tot' = tot - length discarded
              lst' = length new_least

addEl :: (x, Freq) -> Summarizer x -> Summarizer x
addEl (str, f) skip@(Summ strs f_l tot lst)
    | i < f_l && tot >= 5000 = skip
    | otherwise              = sizeLimit $ Summ strs' f_l' tot' lst'
        where strs' = IM.alter (inserting str) f strs
              tot' = tot + 1
              f_l' = min f_l f
              lst' = case compare f_l f of LT -> lst; EQ -> lst + 1; GT -> 1

请注意，我们存储字符串列表以处理重复频率；我们大多跳过更新，当我们 do 更新时，它是一个 O(log n) 操作来放入新元素，有时（再次取决于复制）一个 O(log n) 操作来剪掉最小的元素，然后进行 O(log n) 操作以找到新的最小元素。

查找最频繁项目的最有效数据结构

Most efficient data structure for finding most frequent items

sorting

haskell

data-structures

word-frequency