R 3.4.1 - RSiteCatalyst 排队报告的 while 循环的智能使用
R 3.4.1 - Intelligent use of while loop for RSiteCatalyst enqueued reports
实际
我现在使用 RSiteCatalyst 软件包有一段时间了。对于那些不知道它的人,它使通过 API 从 Adobe Analytics 获取数据的过程变得更加容易。
到目前为止,工作流程如下:
- 提出请求,例如:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20)
等待将被保存为data.frame(示例结构)的响应:
> View(head(key_metrics,1))
datetime name year month day pageviews
1 2015-07-01 July 2015 2015 7 1 45825
做一些数据转换(例如:
key_metrics$datetime <- as.Date(key_metrics$datetime)
这个工作流的问题是有时(因为请求的复杂性),我们可以等待很长时间直到响应最终到来。如果 R 脚本包含 40-50 API 个同样复杂的请求,这意味着我们将等待 40-50 次,直到数据终于到来,我们才能执行新的请求。这显然在我的 ETL 过程中产生了瓶颈。
目标
然而,在包的大部分功能中有一个参数 enqueueOnly
,它告诉 Adobe 在交付报告 ID 作为响应时处理请求:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20,
enqueueOnly = TRUE)
> key_metrics
[1] 1154642436
我可以通过使用以下函数随时获得“真实”响应(带有数据):
key_metrics <- GetReport(key_metrics)
在每个请求中,我在生成报告 ID 和报告名称列表时添加参数 enqueueOnly = TRUE
:
queueFromIds <- c(queueFromIds, key_metrics)
queueFromNames <- c(queueFromNames, "key_metrics")
这种方法最重要的区别是我的所有请求都由 Adobe 同时处理,因此等待时间大大减少。
问题
但是,我在有效获取数据方面遇到了问题。我正在尝试使用 while
循环,在获得数据后从先前的向量中删除密钥 ID 和密钥名称:
while (length(queueFromNames)>0)
{
assign(queueFromNames[1], GetReport(queueFromIds[1],
max.attempts = 3,
interval.seconds = 5))
queueFromNames <- queueFromNames[-1]
queueFromIds <- queueFromIds[-1]
}
但是,这只适用于请求足够简单,可以在几秒钟内处理的情况。当请求足够复杂以致无法以 5 秒的间隔尝试 3 次时,循环将停止并出现以下错误:
Error in ApiRequest(body = toJSON(request.body), func.name =
"Report.Get", : ERROR: max attempts exceeded for
https://api3.omniture.com/admin/1.4/rest/?method=Report.Get
哪些功能可以帮助我控制所有 API 请求都得到正确处理,并且在最佳情况下,API 请求需要额外的时间(它们会产生错误)被跳过直到循环结束,当它们再次被请求时?
我使用几个函数来独立 generate/retrieve 报告 ID。这样,处理报告需要多长时间都无关紧要。我通常会在报告 ID 生成 12 小时后回来找他们。我认为它们会在 48 小时左右后过期。这些功能当然依赖于 RSiteCatalyst。以下是功能:
#' Generate report IDs to be retrieved later
#'
#' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
#' @param suite Report suite ID.
#' @param dateBegin Start date in the following format: YYYY-MM-DD.
#' @param dateFinish End date in the following format: YYYY-MM-DD.
#' @param metrics Vector containing up to 30 required metrics IDs.
#' @param elements Vector containing element IDs.
#' @param classification Vector containing classification IDs.
#'@param valueStart Integer value pointing to row to start report with.
#' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
#' @examples
#' \dontrun{
#' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
#'}
#' @export
reportsGenerator <- function(suite,
dateBegin,
dateFinish,
metrics,
elements,
classification,
valueStart) {
#Convert dates to date format.
#Deduct one from dateBegin to
#neutralize the initial +1 in the loop.
dateBegin <- as.Date(dateBegin, "%Y-%m-%d") - 1
dateFinish <- as.Date(dateFinish, "%Y-%m-%d")
timeRange <- dateFinish - dateBegin
#Create data frame to store dates and report IDs
VisitorActivityReports <-
data.frame(matrix(NA, nrow = timeRange, ncol = 2))
names(VisitorActivityReports) <- c("Date", "ReportID")
#Run a loop to retrieve one ReportID for each day in the time period.
for (i in 1:timeRange) {
dailyDate <- as.character(dateBegin + i)
print(i) #Visibility to end user
print(dailyDate) #Visibility to end user
VisitorActivityReports[i, 1] <- dailyDate
VisitorActivityReports[i, 2] <-
RSiteCatalyst::QueueTrended(
reportsuite.id = suite,
date.from = dailyDate,
date.to = dailyDate,
metrics = metrics,
elements = elements,
classification = classification,
top = 50000,
max.attempts = 500,
start = valueStart,
enqueueOnly = T
)
}
return(VisitorActivityReports)
}
您应该将上一个函数的输出分配给一个变量。然后将该变量用作以下函数的输入。还将 reportsRetriever 的结果分配给一个变量。输出将是一个数据框。该函数将 rbind 所有报告放在一起,只要它们共享相同的结构。不要尝试连接具有不同结构的报表。
#' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
#'
#' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
#' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
#' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
#' @examples
#' \dontrun{
#' visitorActivity <- reportsRetriever(dataFrameReports)
#'}
#'
#' @export
reportsRetriever <- function(dataFrameReports) {
visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))
#Validate report integrity
if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
print("Ok. All reports available")
return(visitor.activity.df)
} else {
print("Some reports may have been missed.")
missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)
return(visitor.activity.df)
}
}
实际
我现在使用 RSiteCatalyst 软件包有一段时间了。对于那些不知道它的人,它使通过 API 从 Adobe Analytics 获取数据的过程变得更加容易。
到目前为止,工作流程如下:
- 提出请求,例如:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20)
等待将被保存为data.frame(示例结构)的响应:
> View(head(key_metrics,1)) datetime name year month day pageviews 1 2015-07-01 July 2015 2015 7 1 45825
做一些数据转换(例如:
key_metrics$datetime <- as.Date(key_metrics$datetime)
这个工作流的问题是有时(因为请求的复杂性),我们可以等待很长时间直到响应最终到来。如果 R 脚本包含 40-50 API 个同样复杂的请求,这意味着我们将等待 40-50 次,直到数据终于到来,我们才能执行新的请求。这显然在我的 ETL 过程中产生了瓶颈。
目标
然而,在包的大部分功能中有一个参数 enqueueOnly
,它告诉 Adobe 在交付报告 ID 作为响应时处理请求:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20,
enqueueOnly = TRUE)
> key_metrics
[1] 1154642436
我可以通过使用以下函数随时获得“真实”响应(带有数据):
key_metrics <- GetReport(key_metrics)
在每个请求中,我在生成报告 ID 和报告名称列表时添加参数 enqueueOnly = TRUE
:
queueFromIds <- c(queueFromIds, key_metrics)
queueFromNames <- c(queueFromNames, "key_metrics")
这种方法最重要的区别是我的所有请求都由 Adobe 同时处理,因此等待时间大大减少。
问题
但是,我在有效获取数据方面遇到了问题。我正在尝试使用 while
循环,在获得数据后从先前的向量中删除密钥 ID 和密钥名称:
while (length(queueFromNames)>0)
{
assign(queueFromNames[1], GetReport(queueFromIds[1],
max.attempts = 3,
interval.seconds = 5))
queueFromNames <- queueFromNames[-1]
queueFromIds <- queueFromIds[-1]
}
但是,这只适用于请求足够简单,可以在几秒钟内处理的情况。当请求足够复杂以致无法以 5 秒的间隔尝试 3 次时,循环将停止并出现以下错误:
Error in ApiRequest(body = toJSON(request.body), func.name = "Report.Get", : ERROR: max attempts exceeded for https://api3.omniture.com/admin/1.4/rest/?method=Report.Get
哪些功能可以帮助我控制所有 API 请求都得到正确处理,并且在最佳情况下,API 请求需要额外的时间(它们会产生错误)被跳过直到循环结束,当它们再次被请求时?
我使用几个函数来独立 generate/retrieve 报告 ID。这样,处理报告需要多长时间都无关紧要。我通常会在报告 ID 生成 12 小时后回来找他们。我认为它们会在 48 小时左右后过期。这些功能当然依赖于 RSiteCatalyst。以下是功能:
#' Generate report IDs to be retrieved later
#'
#' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
#' @param suite Report suite ID.
#' @param dateBegin Start date in the following format: YYYY-MM-DD.
#' @param dateFinish End date in the following format: YYYY-MM-DD.
#' @param metrics Vector containing up to 30 required metrics IDs.
#' @param elements Vector containing element IDs.
#' @param classification Vector containing classification IDs.
#'@param valueStart Integer value pointing to row to start report with.
#' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
#' @examples
#' \dontrun{
#' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
#'}
#' @export
reportsGenerator <- function(suite,
dateBegin,
dateFinish,
metrics,
elements,
classification,
valueStart) {
#Convert dates to date format.
#Deduct one from dateBegin to
#neutralize the initial +1 in the loop.
dateBegin <- as.Date(dateBegin, "%Y-%m-%d") - 1
dateFinish <- as.Date(dateFinish, "%Y-%m-%d")
timeRange <- dateFinish - dateBegin
#Create data frame to store dates and report IDs
VisitorActivityReports <-
data.frame(matrix(NA, nrow = timeRange, ncol = 2))
names(VisitorActivityReports) <- c("Date", "ReportID")
#Run a loop to retrieve one ReportID for each day in the time period.
for (i in 1:timeRange) {
dailyDate <- as.character(dateBegin + i)
print(i) #Visibility to end user
print(dailyDate) #Visibility to end user
VisitorActivityReports[i, 1] <- dailyDate
VisitorActivityReports[i, 2] <-
RSiteCatalyst::QueueTrended(
reportsuite.id = suite,
date.from = dailyDate,
date.to = dailyDate,
metrics = metrics,
elements = elements,
classification = classification,
top = 50000,
max.attempts = 500,
start = valueStart,
enqueueOnly = T
)
}
return(VisitorActivityReports)
}
您应该将上一个函数的输出分配给一个变量。然后将该变量用作以下函数的输入。还将 reportsRetriever 的结果分配给一个变量。输出将是一个数据框。该函数将 rbind 所有报告放在一起,只要它们共享相同的结构。不要尝试连接具有不同结构的报表。
#' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
#'
#' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
#' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
#' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
#' @examples
#' \dontrun{
#' visitorActivity <- reportsRetriever(dataFrameReports)
#'}
#'
#' @export
reportsRetriever <- function(dataFrameReports) {
visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))
#Validate report integrity
if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
print("Ok. All reports available")
return(visitor.activity.df)
} else {
print("Some reports may have been missed.")
missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)
return(visitor.activity.df)
}
}