运行 计算大文件中的字符数时内存不足

Running out of memory while counting characters in a large file

我想统计一个大文件中每个字符出现的次数。虽然我知道计数应该在 Haskell 中以严格的方式实现(我试图使用 foldl' 实现),但我仍然 运行 内存不足。作为对比:文件大小约为2GB,而电脑有100GB内存。该文件中没有很多不同的字符 - 可能有 20 个。我做错了什么?

ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
    | c == d = (c,i+1):cs
    | otherwise = (c,i) : ins cs d

main = do
    [file] <- getArgs
    txt <- readFile file
    print $ foldl' ins [] txt

您的 ins 函数正在从 Control.DeepSeq 中创建大量 thunks that cause a lot of memory leak. foldl' only evaluates to weak head normal form which is not enough here. What you need is deepseq 以获得 正常形式 .

或者,使用 Data.Map.Strict for counting. Also, If your IO is on the order of 2GB, you better use lazy ByteString 代替普通字符串,而不是关联列表。

无论输入大小如何,下面的代码都应该在常量内存中执行space:

import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
  where
  go = flip $ alter inc
  inc :: Maybe Int -> Maybe Int
  inc Nothing  = Just 1
  inc (Just i) = Just $ i + 1