如何 select 来自 DataFrame 的随机行的确切数量

Question

我怎样才能有效地select 精确个来自 DataFrame 的随机行？数据包含一个可以使用的索引列。如果我必须使用最大大小，索引列上的 count() 或 max() 哪个更有效？

Answer 1

一种可能的方法是使用.count()计算行数，然后使用python的random library中的sample()生成任意长度的随机序列从这个范围。最后使用生成的数字列表 vals 对索引列进行子集化。

import random 
def sampler(df, col, records):

  # Calculate number of rows
  colmax = df.count()

  # Create random sample from range
  vals = random.sample(range(1, colmax), records)

  # Use 'vals' to filter DataFrame using 'isin'
  return df.filter(df[col].isin(vals))

示例：

df = sc.parallelize([(1,1),(2,1),
                     (3,1),(4,0),
                     (5,0),(6,1),
                     (7,1),(8,0),
                     (9,0),(10,1)]).toDF(["a","b"])

sampler(df,"a",3).show()
+---+---+
|  a|  b|
+---+---+
|  3|  1|
|  4|  0|
|  6|  1|
+---+---+

Answer 2

这是使用 Pandas DataFrame.Sample 方法的替代方法。这使用 spark applyInPandas 方法来分发组，可从 Spark 3.0.0 获得。这允许您 select 每组的确切行数。

我已将 args 和 kwargs 添加到函数中，以便您可以访问 DataFrame.Sample 的其他参数。

def sample_n_per_group(n, *args, **kwargs):
    def sample_per_group(pdf):
        return pdf.sample(n, *args, **kwargs)
    return sample_per_group

df = spark.createDataFrame(
    [
        (1, 1.0), 
        (1, 2.0), 
        (2, 3.0), 
        (2, 5.0), 
        (2, 10.0)
    ],
    ("id", "v")
)

(df.groupBy("id")
   .applyInPandas(
        sample_n_per_group(2, random_state=2), 
        schema=df.schema
   )
)

要了解非常大的组的限制，来自 documentation：

This function requires a full shuffle. All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.

另见此处：

如何 select 来自 DataFrame 的随机行的确切数量

How to select an exact number of random rows from DataFrame

random

apache-spark

spark-dataframe