减少 Haskell 程序的内存使用

Question

我在 Haskell 中有以下程序：

processDate :: String -> IO ()
processDate date = do
    ...
    let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
    geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let addresses = fmap location properties
    let batchAddresses = chunksOf 100 addresses
    batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return (zip properties geocodedLocations)

geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
    mapQuestKey <- getEnv "MAP_QUEST_KEY"
    geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
    return $ geocodeResponseToResults geocodeResponse

geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
    latLangs
    where
        decodedResponse :: Maybe GeocodingResponse
        decodedResponse = decodeGeocodingResponse inputResponse

        latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse

decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse

它从 html 文件中读取属性列表（住宅和公寓），解析它们，对地址进行地理编码并将结果保存到 sqlite 数据库中。
除了非常高的内存使用率（大约 800M）外，一切都很好。
通过注释掉代码，我已经确定问题出在地理编码步骤上。
我一次发送 100 个地址到 MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
100 个地址的响应非常庞大，因此它可能是罪魁祸首之一，但 800M？我觉得它会保留所有结果，直到最后导致内存使用率如此之高。

注释掉程序的地理编码部分后，内存使用量约为 30M，这很好。

您可以在此处获取重现该问题的完整版本：https://github.com/Leonti/haskell-memory-so

我是 Haskell 的新手，所以不确定如何优化它。
有任何想法吗？

干杯！

Answer 1

可能值得记录的是，这原来是 simple streaming problem 由使用 mapM 和 sequence 引起的，而 replicateM 和 traverse 和其他让你 "extract a list from IO" 总是引起积累担忧的事情。因此需要绕过流媒体库。所以在 repo 中有必要只替换

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date
    allProperties <- mapM fileToProperties allFiles
    let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
    geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
    print geocodedProperties

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let batchProperties = chunksOf 100 properties
    batchGeocodedLocations <- mapM geocodeAddresses batchProperties
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return geocodedLocations

像这样

import Streaming
import qualified Streaming.Prelude as S

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date   -- we accept an unstreamed list
    S.print $ propertiesWithGeocoding -- this was the main pain point see below
            $ S.filter hasPrice 
            $ S.concat 
            $ S.mapM fileToProperties -- this mapM doesn't accumulate
            $ S.each allFiles    -- the list is converted to a stream

propertiesWithGeocoding
  :: Stream (Of ParsedProperty) IO r
     -> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =  
    S.concat $ S.concat 
             $ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
             $ S.mapped S.toList       -- convert segments to haskell lists
             $ chunksOf 100 properties -- this is the streaming `chunksOf`
    -- concat here flattens a stream of lists of as into a stream of as
    -- and a stream of maybe as into a stream of as

然后内存使用看起来像这样，每个峰值对应于一次 Mapquest 之旅，随后是一些处理和打印，于是 ghc 忘记了一切并继续前进：

当然这可以用 pipes 或 conduit 来完成。但在这里我们只需要一点点简单的 mapM / sequence/ traverse / replicateM 避免和 streaming 可能是这种快速局部重构的最简单方法.请注意，此列表非常短，所以“但是 mapM/traverse/etc 的短列表很酷！”的想法可能非常错误。为什么不干脆摆脱它们呢？无论何时写列表 mapM f 考虑 S.mapM f . S.each （或等效的管道或管道）是个好主意。您现在将拥有一个流，并且可以使用 S.toList 或等效的来恢复列表，但是很可能，在这种情况下，你会发现你不需要具体化的累积列表，但可以使用一些流处理，如打印到文件或标准输出或将东西写入数据库，在制作任何需要操作的列表之后（这里我们使用例如 streaming filter 和 concat 来展平流式列表并作为一种 catMaybe）。

减少 Haskell 程序的内存使用

Reduce memory usage of a Haskell program

streaming

haskell

aeson