用于在 Spark 中过滤 RDD 的 Lambda 函数（Python）-检查元素是否不是空字符串

Question

我有以下RDD

2019-09-24,Debt collection,transworld systems inc. is trying to collect a debt that is not mine not owed and is inaccurate.

2019-09-19,Credit reporting credit repair services or other personal consumer reports,

3 个元素中的每一个都相应地表示

我需要应用过滤器转换以仅保留以“201”（日期）开头并包含注释（它们具有值并且在第三个元素中不是空字符串）的记录。

我使用以下代码来计算每次从过滤转换中减少了多少条记录：

countA = rdd.count()

countB = rdd.filter(lambda x: x.startswith('201')).count()

countC = rdd.filter(lambda x: x.startswith('201') & (x.split(",")[2] != None) & (len(x.split(",")[2]) > 0)).count()

我的代码在 countC 的计算中崩溃了，虽然在我进一步的计算中过滤似乎有效，但我也得到了更多的错误......

Answer 1

您遇到错误：

IndexError: list index out of range

因为您正在尝试访问列表的索引 2（拆分的结果），如果数据集中的某些行只有日期或日期和标签或者为空或者可能不存在，则该索引可能不存在有格式问题。

在您的 lambda 函数中，您可以利用 python 中的短路来首先检查是否至少有 3 个元素（即 2 的索引可以使用 len(x.split(",")) >=3 而不是 (x.split(",")[2] != None)) 在尝试访问此索引之前。

这可以写成：

countC = rdd.filter(lambda x: x.startswith('201') and (len(x.split(",")) >=3) and (len(x.split(",")[2]) > 0))

让我知道这是否适合你。

Lambda function for filtering RDD in Spark(Python) - check if element not empty string