为什么 google dataproc 不提取 coreNLP jar，尽管它们包含在 POM 文件中？

Question

我的应用程序是一个使用 Spark 的 java Maven 项目。这是我的 pom 中添加 stanford coreNLP 依赖项的部分：

        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.6.0</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.6.0</version>
            <classifier>models</classifier>
        </dependency>

我收到以下错误：

java.lang.NoClassDefFoundError: edu/stanford/nlp/pipeline/StanfordCoreNLP

还有其他依赖项，例如Spark 和 dataproc 可以很好地处理它们。现在我添加了 coreNLP，它在我的笔记本电脑上运行良好，但在 google 数据处理中失败。

Answer 1

Spark 类在 Dataproc 环境中 "provided" 因为它们被认为是基本发行版的一部分，以及其他与 Hadoop 相关的软件包，如 hadoop-client. Other libraries which aren't part of the base distribution should be packaged as part of your "fatjar" using the Maven shade plugin.

通常这是一个最佳实践指南，因为 "provided" 环境应该尽可能不受版本依赖性影响，这样您就可以根据需要使用自己的 corenlp 版本，而不必担心版本冲突在 Dataproc 环境中，甚至使用您自己的 corenlp 库的分叉版本。

为什么 google dataproc 不提取 coreNLP jar，尽管它们包含在 POM 文件中？

why does google dataproc does not pull coreNLP jars although they are included in POM file?

java

stanford-nlp

maven

google-cloud-platform

google-cloud-dataproc