运行计算大文件中的字符数时内存不足

Question

我想统计一个大文件中每个字符出现的次数。虽然我知道计数应该在 Haskell 中以严格的方式实现（我试图使用 foldl' 实现），但我仍然运行内存不足。作为对比：文件大小约为2GB，而电脑有100GB内存。该文件中没有很多不同的字符 - 可能有 20 个。我做错了什么？

ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
    | c == d = (c,i+1):cs
    | otherwise = (c,i) : ins cs d

main = do
    [file] <- getArgs
    txt <- readFile file
    print $ foldl' ins [] txt

Answer 1

您的 ins 函数正在从 Control.DeepSeq 中创建大量 thunks that cause a lot of memory leak. foldl' only evaluates to weak head normal form which is not enough here. What you need is deepseq 以获得 正常形式 .

或者，使用 Data.Map.Strict for counting. Also, If your IO is on the order of 2GB, you better use lazy ByteString 代替普通字符串，而不是关联列表。

无论输入大小如何，下面的代码都应该在常量内存中执行space：

import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
  where
  go = flip $ alter inc
  inc :: Maybe Int -> Maybe Int
  inc Nothing  = Just 1
  inc (Just i) = Just $ i + 1

运行计算大文件中的字符数时内存不足

Running out of memory while counting characters in a large file

haskell

out-of-memory

运行 计算大文件中的字符数时内存不足

Running out of memory while counting characters in a large file

haskell

out-of-memory

运行计算大文件中的字符数时内存不足