如何在 spark scala 的 header 的所有列中附加常量
how to append cosntant in all columns of header in spark scala
例如这是我现有的 header
DataPartition|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
我想创建 header 如下所示
DataPartition_1|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime_1|^|SourceTypeCode_1|^|DocumentId_1|^|Dcn_1|^|DocFormat_1|^|StatementDate_1|^|IsFilingDateTimeEstimated_1|^|ContainsPreliminaryData_1|^|CapitalChangeAdjustmentDate_1|^|CumulativeAdjustmentFactor_1|^|ContainsRestatement_1|^|FilingDateTimeUTCOffset_1|^|ThirdPartySourceCode_1|^|ThirdPartySourcePriority_1|^|SourceTypeId_1|^|ThirdPartySourceCodeId_1|^|FFAction_1
除了列 TimeStamp|^|Source.organizationId|^|Source.sourceId
我想在所有 header 列中附加 _1
我已经通过使用 with withColumn
完成了它,但是我必须对所有列使用它。
有没有像使用 foldLeft
这样简单的方法?
首先,您需要定义要跳过的列的列表:
val columnsToAvoid = List("TimeStamp","Source.organizationId","Source.sourceId")
然后你可以 foldLeft
在 dataFrame 的列列表(由 df.columns
给出)上重命名它不包含在 columnsToAvoid 列表中的每一列,否则返回未更改的 dataFrame。
df.columns.foldLeft(df)((acc, elem) =>
if (columnsToAvoid.contains(elem)) acc
else acc.withColumnRenamed(elem, elem+"_1"))
这里有一个简单的例子:
原DF
+-----+------+-----------+
| word| value| TimeStamp|
+-----+------+-----------+
|wordA|valueA|45435345435|
|wordB|valueB| 454244345|
|wordC|valueC|32425425435|
+-----+------+-----------+
操作:
df.columns.foldLeft(df)((acc, elem) => if (columnsToAvoid.contains(elem)) acc else acc.withColumnRenamed(elem, elem+"_1")).show
结果:
+------+-------+-----------+
|word_1|value_1| TimeStamp|
+------+-------+-----------+
| wordA| valueA|45435345435|
| wordB| valueB| 454244345|
| wordC| valueC|32425425435|
+------+-------+-----------+
例如这是我现有的 header
DataPartition|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
我想创建 header 如下所示
DataPartition_1|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime_1|^|SourceTypeCode_1|^|DocumentId_1|^|Dcn_1|^|DocFormat_1|^|StatementDate_1|^|IsFilingDateTimeEstimated_1|^|ContainsPreliminaryData_1|^|CapitalChangeAdjustmentDate_1|^|CumulativeAdjustmentFactor_1|^|ContainsRestatement_1|^|FilingDateTimeUTCOffset_1|^|ThirdPartySourceCode_1|^|ThirdPartySourcePriority_1|^|SourceTypeId_1|^|ThirdPartySourceCodeId_1|^|FFAction_1
除了列 TimeStamp|^|Source.organizationId|^|Source.sourceId
我想在所有 header 列中附加 _1
我已经通过使用 with withColumn
完成了它,但是我必须对所有列使用它。
有没有像使用 foldLeft
这样简单的方法?
首先,您需要定义要跳过的列的列表:
val columnsToAvoid = List("TimeStamp","Source.organizationId","Source.sourceId")
然后你可以 foldLeft
在 dataFrame 的列列表(由 df.columns
给出)上重命名它不包含在 columnsToAvoid 列表中的每一列,否则返回未更改的 dataFrame。
df.columns.foldLeft(df)((acc, elem) =>
if (columnsToAvoid.contains(elem)) acc
else acc.withColumnRenamed(elem, elem+"_1"))
这里有一个简单的例子:
原DF
+-----+------+-----------+
| word| value| TimeStamp|
+-----+------+-----------+
|wordA|valueA|45435345435|
|wordB|valueB| 454244345|
|wordC|valueC|32425425435|
+-----+------+-----------+
操作:
df.columns.foldLeft(df)((acc, elem) => if (columnsToAvoid.contains(elem)) acc else acc.withColumnRenamed(elem, elem+"_1")).show
结果:
+------+-------+-----------+
|word_1|value_1| TimeStamp|
+------+-------+-----------+
| wordA| valueA|45435345435|
| wordB| valueB| 454244345|
| wordC| valueC|32425425435|
+------+-------+-----------+