monotonically_increasing_id 大于数据框中的总记录数

Question

我有下面的 pyspark 代码。我正在尝试创建一个按顺序增加的新 match_id。旧的 match_id 是随机的唯一值。 match_id 表示数据中匹配的产品对。所以我按 id 对原始数据进行排序。可以有多个id对应一个match_id。我正在为每个 match_id 使用第一个 id。然后按 id 对第一个 id 和 match_id 的较小数据帧进行排序。然后我创建了一个新的单调递增的 id。然后我加入 match_id 上的原始数据帧，这样我就可以得到具有新的单调递增 match_id 的原始数据帧。我遇到的问题是新 match_id 的值大于任一数据框中的记录数。这怎么可能？ new_match_id 不应该等于 id_to_match_id 的行号加 1 吗？那么最大的new_match_id应该是377729吧？有人可以解释是什么导致了这些大的 new_match_id 并建议我如何正确创建 new_match_id 吗？

代码：

# partition by match_id and get first id


w2 = Window().partitionBy("match_id").orderBy(new_matched_df.id.asc())

id_to_match_id=new_matched_df\
.select(col("match_id"), first("id",True).over(w2).alias('id')).distinct()

# creating new match_id

from pyspark.sql.functions import monotonically_increasing_id

id_to_match_id=id_to_match_id.sort('id',ascending=True).withColumn('new_match_id',(monotonically_increasing_id()+1))


new_matched_df2=new_matched_df

# replacing old match_id with new match_id
new_matched_df2=new_matched_df2.alias('a')\
.join(id_to_match_id.alias('b'),
     (col('a.match_id')==col('b.match_id')),
      how='inner'
     )\
.select(col('a.storeid'),
        col('a.product_id'),
        col('a.productname'),
        col('a.productbrand'),
        col('a.producttype'),
        col('a.productsubtype'),
        col('a.classification'),
        col('a.weight'),
        col('a.unitofmeasure'),
        col('a.id'),
        col('b.new_match_id').alias('match_id'))


id_to_match_id.sort('new_match_id',ascending=False).show()


print(new_matched_df2.count())

print(id_to_match_id.count())

输出：

+------------+------+-------------+
|    match_id|    id| new_match_id|
+------------+------+-------------+
|412316878198|864316|1709396985719|
|412316878188|864306|1709396985718|
|412316878183|864301|1709396985717|
|412316878182|864300|1709396985716|
|412316878181|864299|1709396985715|
|412316878178|864296|1709396985714|
|412316878177|864295|1709396985713|
|412316878175|864293|1709396985712|
|412316878174|864292|1709396985711|
|412316878169|864287|1709396985710|
|412316878160|864278|1709396985709|
|412316878156|864274|1709396985708|
|412316878154|864272|1709396985707|
|412316878149|864267|1709396985706|
|412316878148|864266|1709396985705|
|412316878146|864264|1709396985704|
|412316878145|864263|1709396985703|
|412316878143|864261|1709396985702|
|412316878136|864254|1709396985701|
|412316878135|864253|1709396985700|
+------------+------+-------------+


864302
377728

Answer 1

您好，请查看 this 了解更多信息。

根据 monotonically_increasing_id 的文档。

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

你需要的是这样的东西

from pyspark.sql.window import Window 
from pyspark.sql import functions as 

df2 = df1.withColumn("id", fun.monotonically_increasing_id())

windowSpec = Window.orderBy("idx")

df2.withColumn("id", fun.row_number().over(windowSpec)).show()

+-----+---+
|value|id |
+-----+---+
|    1|  1|
|    3|  2|
|    9|  3|
|   13|  4|
+-----+---+

这将创建您需要的 ID 类型。

monotonically_increasing_id 大于数据框中的总记录数

monotonically_increasing_id larger than total records in dataframe

apache-spark

pyspark

pyspark-sql