减少 Haskell 程序的内存使用

Reduce memory usage of a Haskell program

我在 Haskell 中有以下程序:

processDate :: String -> IO ()
processDate date = do
    ...
    let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
    geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let addresses = fmap location properties
    let batchAddresses = chunksOf 100 addresses
    batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return (zip properties geocodedLocations)

geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
    mapQuestKey <- getEnv "MAP_QUEST_KEY"
    geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
    return $ geocodeResponseToResults geocodeResponse

geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
    latLangs
    where
        decodedResponse :: Maybe GeocodingResponse
        decodedResponse = decodeGeocodingResponse inputResponse

        latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse

decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse  

它从 html 文件中读取属性列表(住宅和公寓),解析它们,对地址进行地理编码并将结果保存到 sqlite 数据库中。
除了非常高的内存使用率(大约 800M)外,一切都很好。
通过注释掉代码,我已经确定问题出在地理编码步骤上。
我一次发送 100 个地址到 MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
100 个地址的响应非常庞大,因此它可能是罪魁祸首之一,但 800M?我觉得它会保留所有结果,直到最后导致内存使用率如此之高。

注释掉程序的地理编码部分后,内存使用量约为 30M,这很好。

您可以在此处获取重现该问题的完整版本:https://github.com/Leonti/haskell-memory-so

我是 Haskell 的新手,所以不确定如何优化它。
有任何想法吗?

干杯!

可能值得记录的是,这原来是 simple streaming problem 由使用 mapMsequence 引起的,而 replicateMtraverse 和其他让你 "extract a list from IO" 总是引起积累担忧的事情。因此需要绕过流媒体库。所以在 repo 中有必要只替换

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date
    allProperties <- mapM fileToProperties allFiles
    let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
    geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
    print geocodedProperties

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let batchProperties = chunksOf 100 properties
    batchGeocodedLocations <- mapM geocodeAddresses batchProperties
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return geocodedLocations

像这样

import Streaming
import qualified Streaming.Prelude as S

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date   -- we accept an unstreamed list
    S.print $ propertiesWithGeocoding -- this was the main pain point see below
            $ S.filter hasPrice 
            $ S.concat 
            $ S.mapM fileToProperties -- this mapM doesn't accumulate
            $ S.each allFiles    -- the list is converted to a stream

propertiesWithGeocoding
  :: Stream (Of ParsedProperty) IO r
     -> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =  
    S.concat $ S.concat 
             $ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
             $ S.mapped S.toList       -- convert segments to haskell lists
             $ chunksOf 100 properties -- this is the streaming `chunksOf`
    -- concat here flattens a stream of lists of as into a stream of as
    -- and a stream of maybe as into a stream of as

然后内存使用看起来像这样,每个峰值对应于一次 Mapquest 之旅,随后是一些处理和打印,于是 ghc 忘记了一切并继续前进:

当然这可以用 pipesconduit 来完成。但在这里我们只需要一点点简单的 mapM / sequence/ traverse / replicateM 避免和 streaming 可能是这种快速局部重构的最简单方法.请注意,此列表非常短,所以“但是 mapM/traverse/etc 的短列表很酷!”的想法可能非常错误。为什么不干脆摆脱它们呢?无论何时写列表 mapM f 考虑 S.mapM f . S.each (或等效的管道或管道)是个好主意。您现在将拥有一个流,并且可以使用 S.toList 或等效的来恢复列表,但是很可能,在这种情况下,你会发现你不需要具体化的累积列表,但可以使用一些流处理,如打印到文件或标准输出或将东西写入数据库,在制作任何需要操作的列表之后(这里我们使用例如 streaming filterconcat 来展平流式列表并作为一种 catMaybe)。