如何在 AWS EMR 上使用 sbt 程序集在 scala 中创建单个 jar? 运行 进入去重:在以下发现不同的文件内容:错误
How to create single jar in scala with sbt assembly on AWS EMR? Running into deduplicate: different file contents found in the following: errors
我在刚刚站起来的 AWS EMR 集群上,编译了一个 scala 文件,我想将其构建到程序集中。但是,当我发布 sbt assembly 时,我 运行 进入重复数据删除错误。
根据 https://medium.com/@tedherman/compile-scala-on-emr-cb77610559f0 我最初有一个符号 link 用于我的 lib 到 usr lib spark jars;
ln -s /usr/lib/spark/jars lib
虽然我注意到我的代码在有或没有这个的情况下都通过了 sbt 编译。然而,我很困惑 why/how 来解决 sbt 程序集欺骗错误。我还会注意到,我根据文章在 bootstrap 操作中安装了 sbt。
用符号link在
一些重复数据删除似乎是明确的重复数据;示例:
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.parquet/parquet-jackson/jars/parquet-jackson-1.10.1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class
[error] /usr/lib/spark/jars/parquet-jackson-1.10.1-spark-amzn-1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class
其他好像是竞品;
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/spark_project/jetty/util/MultiPartOutputStream.class
[error] /usr/lib/spark/jars/spark-core_2.11-2.4.5-amzn-0.jar:org/spark_project/jetty/util/MultiPartOutputStream.class
我不明白为什么会有竞争版本;或者如果它们默认是这样的,或者我做了一些介绍它们的事情。
没有符号link
我想如果我删除它,我的问题就会少一些;虽然我仍然有骗子(只是更少);
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-api/jars/hadoop-yarn-api-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-common/jars/hadoop-yarn-common-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class
考虑到一个是 hadoop-yarn-api-2.6.5.jar 而另一个是 hadoop-yarn-common-2.6.5.jar。不同的名字,为什么是骗子?
其他好像是版本;
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/javax.inject/javax.inject/jars/javax.inject-1.jar:javax/inject/Named.class
[error] /home/hadoop/.ivy2/cache/org.glassfish.hk2.external/javax.inject/jars/javax.inject-2.4.0-b34.jar:javax/inject/Named.class
有些文件名相同但不同paths/jars...
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-format/jars/arrow-format-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-memory/jars/arrow-memory-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-vector/jars/arrow-vector-0.10.0.jar:git.properties
与这些相同...
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-catalyst_2.11/jars/spark-catalyst_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-graphx_2.11/jars/spark-graphx_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
供参考,一些其他信息
导入我的 Scala 对象
import org.apache.spark.sql.SparkSession
import java.time.LocalDateTime
import com.amazonaws.regions.Regions
import com.amazonaws.services.secretsmanager.AWSSecretsManagerClientBuilder
import com.amazonaws.services.secretsmanager.model.GetSecretValueRequest
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import com.datarobot.prediction.spark.Predictors.{getPredictorFromServer, getPredictor}
我的build.sbt
libraryDependencies ++= Seq(
"net.snowflake" % "snowflake-jdbc" % "3.12.5",
"net.snowflake" % "spark-snowflake_2.11" % "2.7.1-spark_2.4",
"com.datarobot" % "scoring-code-spark-api_2.4.3" % "0.0.19",
"com.datarobot" % "datarobot-prediction" % "2.1.4",
"com.amazonaws" % "aws-java-sdk-secretsmanager" % "1.11.789",
"software.amazon.awssdk" % "regions" % "2.13.23"
)
想法?请指教
您将需要 mergeStrategy
设置 (docs)。
"Random example:"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", _) => MergeStrategy.discard
case PathList("git.properties", _) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
我在刚刚站起来的 AWS EMR 集群上,编译了一个 scala 文件,我想将其构建到程序集中。但是,当我发布 sbt assembly 时,我 运行 进入重复数据删除错误。
根据 https://medium.com/@tedherman/compile-scala-on-emr-cb77610559f0 我最初有一个符号 link 用于我的 lib 到 usr lib spark jars;
ln -s /usr/lib/spark/jars lib
虽然我注意到我的代码在有或没有这个的情况下都通过了 sbt 编译。然而,我很困惑 why/how 来解决 sbt 程序集欺骗错误。我还会注意到,我根据文章在 bootstrap 操作中安装了 sbt。
用符号link在
一些重复数据删除似乎是明确的重复数据;示例:
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.parquet/parquet-jackson/jars/parquet-jackson-1.10.1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class
[error] /usr/lib/spark/jars/parquet-jackson-1.10.1-spark-amzn-1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class
其他好像是竞品;
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/spark_project/jetty/util/MultiPartOutputStream.class
[error] /usr/lib/spark/jars/spark-core_2.11-2.4.5-amzn-0.jar:org/spark_project/jetty/util/MultiPartOutputStream.class
我不明白为什么会有竞争版本;或者如果它们默认是这样的,或者我做了一些介绍它们的事情。
没有符号link
我想如果我删除它,我的问题就会少一些;虽然我仍然有骗子(只是更少);
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-api/jars/hadoop-yarn-api-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-common/jars/hadoop-yarn-common-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class
考虑到一个是 hadoop-yarn-api-2.6.5.jar 而另一个是 hadoop-yarn-common-2.6.5.jar。不同的名字,为什么是骗子?
其他好像是版本;
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/javax.inject/javax.inject/jars/javax.inject-1.jar:javax/inject/Named.class
[error] /home/hadoop/.ivy2/cache/org.glassfish.hk2.external/javax.inject/jars/javax.inject-2.4.0-b34.jar:javax/inject/Named.class
有些文件名相同但不同paths/jars...
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-format/jars/arrow-format-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-memory/jars/arrow-memory-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-vector/jars/arrow-vector-0.10.0.jar:git.properties
与这些相同...
[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-catalyst_2.11/jars/spark-catalyst_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-graphx_2.11/jars/spark-graphx_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
供参考,一些其他信息
导入我的 Scala 对象
import org.apache.spark.sql.SparkSession
import java.time.LocalDateTime
import com.amazonaws.regions.Regions
import com.amazonaws.services.secretsmanager.AWSSecretsManagerClientBuilder
import com.amazonaws.services.secretsmanager.model.GetSecretValueRequest
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import com.datarobot.prediction.spark.Predictors.{getPredictorFromServer, getPredictor}
我的build.sbt
libraryDependencies ++= Seq(
"net.snowflake" % "snowflake-jdbc" % "3.12.5",
"net.snowflake" % "spark-snowflake_2.11" % "2.7.1-spark_2.4",
"com.datarobot" % "scoring-code-spark-api_2.4.3" % "0.0.19",
"com.datarobot" % "datarobot-prediction" % "2.1.4",
"com.amazonaws" % "aws-java-sdk-secretsmanager" % "1.11.789",
"software.amazon.awssdk" % "regions" % "2.13.23"
)
想法?请指教
您将需要 mergeStrategy
设置 (docs)。
"Random example:"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", _) => MergeStrategy.discard
case PathList("git.properties", _) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}