SageMaker 中的预测 :::: 编写函数将大数据帧拆分成批次以进行预测

Question

我正在使用 amazon SageMaker 进行模型训练和预测。但是，我遇到了 InvokeEndpoint 的问题，每个请求仍然有 5MB 的限制。

但是，我有超过 100 万行作为不同的输入，我知道我应该考虑为每个输入发送单独的请求，或者将输入拆分为适合限制的一些批大小，并将每个批作为单独的请求（可能与同一端点并行）。

### Making predictions based on 1 dataframe of 500 rows
### aproximately 500 rows are ~500MB

    num_predict_rows <- 500 
    test_sample <- as.matrix(gender_test[1:num_predict_rows, ])
    dimnames(test_sample)[[2]] <- NULL

    library(stringr)
    predictions <- model_endpoint$predict(data_tbl_test)
    predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
    predictions <- as.numedimnames(data_tbl_test)[[2]] <- NULLric(predictions)

    data_tbl_pred <- cbind(predicted_sample = predictions, data_tbl_test[1:num_predict_rows, ])

我的问题是

我如何编写一个函数

将输入数据帧拆分为 500 行以下 (<5MB)
所以我会有 n 批数据
然后我可以根据上面的代码对所有批次进行预测
最后，我应该有一个组合数据框，其中包含 n bacthes

提前致谢

Answer 1

您可能需要调整它以根据需要构建输出，但如果我理解您的代码，这应该对每个 batch 进行预测，然后将结果存储在 all_preds.

library(stringr)

# some initialization
N <- NROW(data_tbl_test)
num_predict_rows <- 500 
n <- ceiling(N / num_predict_rows)
k <- 1   # This should be the number of columns in model_endpoint$predict(...)
all_preds = matrix(0, NROW(data_tbl_test), k)   # where the predictions will be stored

# get batch indices
ind <- rep(list(NULL), n)
for (i in 1:n)
    ind[[i]] <- seq((i-1)*500+1, min(i*500, N)) 

# predict on each batch
for (i in 1:n){
    batch = data_tbl_test[ind[[i]],]
    predictions <- model_endpoint$predict(batch)
    predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
    predictions <- as.numedimnames(batch)[[2]] <- NULLric(predictions)
    all_preds[ind[[i]],] = predictions
    }

Answer 2

您是否考虑过使用 SageMaker Batch Transform 代替上述用例？它负责将数据从 S3 流式传输到推理容器，并支持几种拆分数据的方法。

请看 https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html for an overview. Also see https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html 如果您要带上自己的推理容器来了解细节。

一些示例笔记本：

如果您有详细问题/需要特定转换作业的支持，请访问 AWS 论坛：https://forums.aws.amazon.com/forum.jspa?forumID=285&start=0

SageMaker 中的预测 :::: 编写函数将大数据帧拆分成批次以进行预测

Predictions in SageMaker ::: Writing Function To Split Big Data-frame Into Batches For Predictions

r

function

machine-learning

batch-processing