Firebase 导出到 BigQuery:保留群组查询

Firebase exported to BigQuery: retention cohorts query

Firebase 通过 Firebase 远程配置提供拆分测试功能,但无法过滤具有用户属性(实际上是 属性 的同类群组部分中的保留)。

为了寻找这个问题的解决方案,我正在寻找 BigQuery,因为 Firebase Analytics 提供了将数据导出到该服务的可用方法。

但我遇到了很多问题,google 没有答案或示例可以指出正确的方向。

一般问题:

作为第一步,我需要聚合代表相同数据的 firebase 队列所做的数据,因此我可以确定我的计算是正确的:

下一步应该只是对查询应用约束,以便它们匹配自定义用户属性。

这是我目前得到的:

主要问题 – 用户计算差异很大。有时是大约 100 个用户,但有时接近 1000。

这是我使用的方法:

# 1

# Count users with `user_dim.first_open_timestamp_micros` 
# in specified period (w0 – week 1)
# this is the way firebase group users to cohorts 
# (who started app on the same day or during the same week) 
# https://support.google.com/firebase/answer/6317510

SELECT
  COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
  (
   TABLE_DATE_RANGE
    (
     [admob-app-id-xx:xx_IOS.app_events_], 
     TIMESTAMP('2016-11-20'), 
     TIMESTAMP('2016-11-26')
    )
  )
WHERE
  STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
  BETWEEN '2016-11-20' AND '2016-11-26'

# 2

# For each next period count events with 
# same first_open_timestamp
# Here is example for one of the weeks. 
# week 0 is Nov20-Nov26, week 1 is Nov27-Dec03

SELECT
  COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
  (
   TABLE_DATE_RANGE
    (
     [admob-app-id-xx:xx_IOS.app_events_], 
     TIMESTAMP('2016-11-27'), 
     TIMESTAMP('2016-12-03')
    )
  )
WHERE
  STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
  BETWEEN '2016-11-20' AND '2016-11-26'

# 3

# Now we have users for each week w1, w2, ... w5
# Calculate retention for each of them
# retention week 1 = w1 / w0 * 100 = 25.72181359
# rw2 = w2 / w1 * 100
# ...
# rw5 = w5 / w1 * 100

# 4 

# Shift week 0 by one and repeat from step 1

BigQuery 查询提示请求

非常感谢有关构建复杂查询的任何提示和指导,这些查询可以在一个步骤中聚合和计算此任务所需的所有数据。

Here is BigQuery Export schema if needed

附带问题:

PS

最好是 Firebase 队友的人回答这个问题。 关于通过过滤或显示大查询示例扩展同类群组功能,但事情没有进展。 Firebase Analytics 是他们说的方式,Google Analytics 已被弃用,他们说。 现在我花了第二天学习 bigquery 并在现有分析工具上构建我自己的解决方案。我不,堆栈溢出不是这个评论的地方,但是你们在想什么?拆分测试可能会在语法上影响我的应用程序的保留。我的应用程序没有售出任何东西,渠道和事件在很多情况下都不是有价值的指标。

Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.

yes, generic bigquery will work fine

以下不是最通用的版本,但可以给你一个思路
在这个例子中我使用 Stack Overflow Data available in Google BigQuery Public Datasets

第一个子select – 活动 – 在大多数情况下,这是您唯一需要重写以反映数据细节的内容。
它的作用是:
一种。定义要为分析设置的时间段。
在下面的示例中 - 它是一个月 - FORMAT_DATE('%Y-%m', ...
但是可以分别用year, week, day or anything else-
• 按年份 - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• 按周 - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• 按天 - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b.此外,它仅“过滤”您需要分析的 events/activity 类型
例如,`WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 寻找 google-bigquery 标记问题的答案

其余的子查询比较通用,大部分都可以按原样使用

#standardSQL
WITH activities AS (
  SELECT answers.owner_user_id AS id,
    FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
  FROM `bigquery-public-data.Whosebug.posts_answers` AS answers
  JOIN `bigquery-public-data.Whosebug.posts_questions` AS questions
  ON questions.id = answers.parent_id
  WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 
  GROUP BY id, period
), cohorts AS (
  SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
  SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
  FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
  SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids 
  FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
  JOIN periods ON periods.period = cohorts.cohort
  GROUP BY cohort, num
), retention AS (
  SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
  FROM periods JOIN activities ON activities.period = periods.period
  JOIN cohorts ON cohorts.id = activities.id 
  GROUP BY cohort, period, num 
)
SELECT 
  CONCAT(cohorts_size.cohort, ' - ',  FORMAT("%'d", cohorts_size.ids), ' users') AS cohort, 
  retention.num - cohorts_size.num AS period_lag, 
  retention.period as period_label,
  ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label  

您可以使用您选择的工具可视化上述查询的结果
注意:您可以使用 period_lag 或 period_label
在下面的示例中查看它们的用法差异

与 period_lag

与 period_label