Nutch fetch 命令不获取数据
Nutch fetch command not fetching data
我有一个集群设置,其中包含以下软件栈:
nutch-branch-2.3.1,
gora-hbase 0.6.1
Hadoop 2.5.2,
hbase-0.98.8-hadoop2
所以初始命令是:注入、生成、获取、解析、更新b
其中前 2 个,即注入、生成工作正常,但是对于 nutch 命令(即使它执行成功)它没有获取任何数据,并且因为获取过程失败,它的后续过程也失败了。
请查找每个进程的计数器日志:
注入作业:
2016-01-08 14:12:45,649 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=114853
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=836443
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=179217
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=59739
Total vcore-seconds taken by all map tasks=59739
Total megabyte-seconds taken by all map tasks=183518208
Map-Reduce Framework
Map input records=29973
Map output records=29973
Input split bytes=94
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=318
CPU time spent (ms)=24980
Physical memory (bytes) snapshot=427704320
Virtual memory (bytes) snapshot=5077356544
Total committed heap usage (bytes)=328728576
injector
urls_injected=29973
File Input Format Counters
Bytes Read=836349
File Output Format Counters
Bytes Written=0
生成作业:
2016-01-08 14:14:38,257 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=137140
FILE: Number of bytes written=623942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=937
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=43788
Total time spent by all reduces in occupied slots (ms)=305690
Total time spent by all map tasks (ms)=14596
Total time spent by all reduce tasks (ms)=61138
Total vcore-seconds taken by all map tasks=14596
Total vcore-seconds taken by all reduce tasks=61138
Total megabyte-seconds taken by all map tasks=44838912
Total megabyte-seconds taken by all reduce tasks=313026560
Map-Reduce Framework
Map input records=14345
Map output records=14342
Map output bytes=1261921
Map output materialized bytes=137124
Input split bytes=937
Combine input records=0
Combine output records=0
Reduce input groups=14342
Reduce shuffle bytes=137124
Reduce input records=14342
Reduce output records=14342
Spilled Records=28684
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1299
CPU time spent (ms)=39600
Physical memory (bytes) snapshot=2060779520
Virtual memory (bytes) snapshot=15215738880
Total committed heap usage (bytes)=1864892416
Generator
GENERATE_MARK=14342
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:14:38,429 INFO [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs
正在获取:
../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50
2016-01-08 14:14:43,142 INFO [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=44
FILE: Number of bytes written=349279
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1087
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30528
Total time spent by all reduces in occupied slots (ms)=136535
Total time spent by all map tasks (ms)=10176
Total time spent by all reduce tasks (ms)=27307
Total vcore-seconds taken by all map tasks=10176
Total vcore-seconds taken by all reduce tasks=27307
Total megabyte-seconds taken by all map tasks=31260672
Total megabyte-seconds taken by all reduce tasks=139811840
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=28
Input split bytes=1087
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=28
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=426
CPU time spent (ms)=11140
Physical memory (bytes) snapshot=1884893184
Virtual memory (bytes) snapshot=15245959168
Total committed heap usage (bytes)=1751646208
FetcherStatus
HitByTimeLimit-QueueFeeder=0
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:15:54,314 INFO [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11
请指教
我已经有一段时间没有使用 nutch 了,但根据记忆,有一段时间可以靠获取页面来生活。例如,如果您抓取 http://helloworld.com today, and try to issue the fetch command again today, then it will probably just finish without fetching anything as the timetolive on the url http://helloworld.com 延迟了 x 天(忘记了默认的生存时间)。
我认为您可以通过清除 crawl_db 并重试来解决此问题 - 或者现在可能有一个命令将 timetolive 设置为 0。
最后经过几个小时的研发,我发现这个问题是因为 nutch 中的一个错误,就像 "The batch id passed to GeneratorJob by option/argument -batchId <id>
is ignored and a generated batch id is used to mark the current batch."。列为问题 https://issues.apache.org/jira/browse/NUTCH-2143
特别感谢andrew-butkus :)
我有一个集群设置,其中包含以下软件栈:
nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, hbase-0.98.8-hadoop2
所以初始命令是:注入、生成、获取、解析、更新b 其中前 2 个,即注入、生成工作正常,但是对于 nutch 命令(即使它执行成功)它没有获取任何数据,并且因为获取过程失败,它的后续过程也失败了。
请查找每个进程的计数器日志:
注入作业:
2016-01-08 14:12:45,649 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=114853
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=836443
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=179217
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=59739
Total vcore-seconds taken by all map tasks=59739
Total megabyte-seconds taken by all map tasks=183518208
Map-Reduce Framework
Map input records=29973
Map output records=29973
Input split bytes=94
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=318
CPU time spent (ms)=24980
Physical memory (bytes) snapshot=427704320
Virtual memory (bytes) snapshot=5077356544
Total committed heap usage (bytes)=328728576
injector
urls_injected=29973
File Input Format Counters
Bytes Read=836349
File Output Format Counters
Bytes Written=0
生成作业:
2016-01-08 14:14:38,257 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=137140
FILE: Number of bytes written=623942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=937
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=43788
Total time spent by all reduces in occupied slots (ms)=305690
Total time spent by all map tasks (ms)=14596
Total time spent by all reduce tasks (ms)=61138
Total vcore-seconds taken by all map tasks=14596
Total vcore-seconds taken by all reduce tasks=61138
Total megabyte-seconds taken by all map tasks=44838912
Total megabyte-seconds taken by all reduce tasks=313026560
Map-Reduce Framework
Map input records=14345
Map output records=14342
Map output bytes=1261921
Map output materialized bytes=137124
Input split bytes=937
Combine input records=0
Combine output records=0
Reduce input groups=14342
Reduce shuffle bytes=137124
Reduce input records=14342
Reduce output records=14342
Spilled Records=28684
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1299
CPU time spent (ms)=39600
Physical memory (bytes) snapshot=2060779520
Virtual memory (bytes) snapshot=15215738880
Total committed heap usage (bytes)=1864892416
Generator
GENERATE_MARK=14342
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:14:38,429 INFO [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs
正在获取:
../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50
2016-01-08 14:14:43,142 INFO [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=44
FILE: Number of bytes written=349279
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1087
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30528
Total time spent by all reduces in occupied slots (ms)=136535
Total time spent by all map tasks (ms)=10176
Total time spent by all reduce tasks (ms)=27307
Total vcore-seconds taken by all map tasks=10176
Total vcore-seconds taken by all reduce tasks=27307
Total megabyte-seconds taken by all map tasks=31260672
Total megabyte-seconds taken by all reduce tasks=139811840
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=28
Input split bytes=1087
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=28
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=426
CPU time spent (ms)=11140
Physical memory (bytes) snapshot=1884893184
Virtual memory (bytes) snapshot=15245959168
Total committed heap usage (bytes)=1751646208
FetcherStatus
HitByTimeLimit-QueueFeeder=0
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:15:54,314 INFO [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11
请指教
我已经有一段时间没有使用 nutch 了,但根据记忆,有一段时间可以靠获取页面来生活。例如,如果您抓取 http://helloworld.com today, and try to issue the fetch command again today, then it will probably just finish without fetching anything as the timetolive on the url http://helloworld.com 延迟了 x 天(忘记了默认的生存时间)。
我认为您可以通过清除 crawl_db 并重试来解决此问题 - 或者现在可能有一个命令将 timetolive 设置为 0。
最后经过几个小时的研发,我发现这个问题是因为 nutch 中的一个错误,就像 "The batch id passed to GeneratorJob by option/argument -batchId <id>
is ignored and a generated batch id is used to mark the current batch."。列为问题 https://issues.apache.org/jira/browse/NUTCH-2143
特别感谢andrew-butkus :)