如何从 pig 脚本中 运行 Mapreduce
How to run Mapreduce from within a pig script
我想了解如何从 pig 脚本中集成调用 mapreduce 作业。
我提到了 link
https://wiki.apache.org/pig/NativeMapReduce
但我不确定该怎么做,因为它会如何理解我的映射器或缩减器代码。解释的不是很清楚
如果有人能举例说明,那将有很大的帮助。
提前致谢,
干杯:)
示例来自 pig documentation
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
在上面的示例中,pig 会将来自 A
的输入数据存储到 inputDir
并从 outputDir
.
加载作业的输出数据
此外,在 HDFS 中有一个名为 wordcount.jar
的 jar,其中有一个名为 org.myorg.WordCount
的 class 和一个主要 class 负责设置映射器和减速器、输入和输出等
您也可以通过 hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir
.
调用 mapreduce 作业
默认情况下,pig 会期待 map/reduce 程序。但是 hadoop 带有默认的 mapper/reducer 实现; Pig 使用它 - 当 map reduce class 未被识别时。
此外,Pig 使用 Hadoop 的属性及其特定属性。尝试设置,在 pig 脚本中的属性下面,它也应该被 Pig 选中。
SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"
也可以使用 -Dmapred.mapper.class
选项进行设置。综合列表是 here
根据您的 hadoop 安装,属性也可能是:
mapreduce.map.class
mapreduce.reduce.class
仅供参考...
hadoop.mapred has been deprecated. Versions before 0.20.1 used mapred.
Versions after that use mapreduce.
而且猪有自己的一套属性,可以使用命令查看pig -help properties
e.g. in my pig installation, below are the properties:
The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v switch
brief=true|false; default is false. This property is the same as -b switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
aggregate.warning=true|false; default is true. If true, prints count of warnings
of each type rather than logging each warning.
Performance tuning:
pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner=true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery=true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression=true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination=true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg=true|false. Default is false.
Determines if partial aggregation is done within map phase,
before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records
by this factor, it gets disabled.
Miscellaneous:
exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.
我想了解如何从 pig 脚本中集成调用 mapreduce 作业。
我提到了 link https://wiki.apache.org/pig/NativeMapReduce
但我不确定该怎么做,因为它会如何理解我的映射器或缩减器代码。解释的不是很清楚
如果有人能举例说明,那将有很大的帮助。
提前致谢, 干杯:)
示例来自 pig documentation
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
在上面的示例中,pig 会将来自 A
的输入数据存储到 inputDir
并从 outputDir
.
此外,在 HDFS 中有一个名为 wordcount.jar
的 jar,其中有一个名为 org.myorg.WordCount
的 class 和一个主要 class 负责设置映射器和减速器、输入和输出等
您也可以通过 hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir
.
默认情况下,pig 会期待 map/reduce 程序。但是 hadoop 带有默认的 mapper/reducer 实现; Pig 使用它 - 当 map reduce class 未被识别时。
此外,Pig 使用 Hadoop 的属性及其特定属性。尝试设置,在 pig 脚本中的属性下面,它也应该被 Pig 选中。
SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"
也可以使用 -Dmapred.mapper.class
选项进行设置。综合列表是 here
根据您的 hadoop 安装,属性也可能是:
mapreduce.map.class
mapreduce.reduce.class
仅供参考...
hadoop.mapred has been deprecated. Versions before 0.20.1 used mapred. Versions after that use mapreduce.
而且猪有自己的一套属性,可以使用命令查看pig -help properties
e.g. in my pig installation, below are the properties:
The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v switch
brief=true|false; default is false. This property is the same as -b switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
aggregate.warning=true|false; default is true. If true, prints count of warnings
of each type rather than logging each warning.
Performance tuning:
pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner=true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery=true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression=true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination=true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg=true|false. Default is false.
Determines if partial aggregation is done within map phase,
before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records
by this factor, it gets disabled.
Miscellaneous:
exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.