使用 Nutch 2.3 我所有的种子网址都被拒绝了
Using Nutch 2.3 all my seed urls are being rejected
我的 dmoz/urls 文件中有 84 个网址
当我执行命令时:bin/nutch inject dmoz
我得到以下信息:
[ec2-user@ip-172-31-47-66 local]$ bin/nutch inject dmoz/
InjectorJob: starting at 2015-07-03 02:33:41
InjectorJob: Injecting urlDir: dmoz
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 84
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03
所有 URL 都被拒绝,这是我的 nutch/conf/regex-url.xml
的片段
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+/[^/]+/
# accept anything else
+.
下面是我执行此操作的 hadoop.log 输出:
2015-07-03 02:33:41,095 INFO crawl.InjectorJob - InjectorJob: starting at 2015-07-03 02:33:41
2015-07-03 02:33:41,096 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: dmoz
2015-07-03 02:33:43,301 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2015-07-03 02:33:43,329 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-07-03 02:33:43,389 WARN snappy.LoadSnappy - Snappy native library not loaded
2015-07-03 02:33:44,278 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2015-07-03 02:33:44,430 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-07-03 02:33:44,768 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 84
2015-07-03 02:33:44,768 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0
2015-07-03 02:33:44,769 INFO crawl.InjectorJob - Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03
如果有人能帮我解决这个问题,我将不胜感激,基本上我所有的网址都被拒绝了,我不确定为什么。
谢谢
-哈迪
好吧,在花了很多时间试图解决问题之后...自从我更改了 conf/regex-urlfilter.txt,我不得不使用 "ant runtime" 重建 nutch...等等最终成功了,所以我过去 2 天的结论和教训是,在 conf 更改后总是编译 nutch。
如果您使用的是 /local 运行时环境,则无需为 conf/ 文件中的每个更改重新编译。
构建 nutch 的运行时(使用 >ant 运行时)后,编译会在 $NUTCH_HOME/runtime/local
下创建 /local 环境。在此之下,有一个conf/目录,本质上是$NUTCH_HOME/conf
的副本。
但是,您可以(并且应该)编辑那里的文件以更改 /local 配置。
因此,如果您想更改爬虫的名称,例如,将 $NUTCH_HOME/runtime/local/conf/nutch-site.xml
和 add/edit 以及 属性 http.agent.name
编辑为您想要的任何名称。
我的 dmoz/urls 文件中有 84 个网址 当我执行命令时:bin/nutch inject dmoz
我得到以下信息:
[ec2-user@ip-172-31-47-66 local]$ bin/nutch inject dmoz/
InjectorJob: starting at 2015-07-03 02:33:41
InjectorJob: Injecting urlDir: dmoz
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 84
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03
所有 URL 都被拒绝,这是我的 nutch/conf/regex-url.xml
的片段# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+/[^/]+/
# accept anything else
+.
下面是我执行此操作的 hadoop.log 输出:
2015-07-03 02:33:41,095 INFO crawl.InjectorJob - InjectorJob: starting at 2015-07-03 02:33:41
2015-07-03 02:33:41,096 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: dmoz
2015-07-03 02:33:43,301 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2015-07-03 02:33:43,329 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-07-03 02:33:43,389 WARN snappy.LoadSnappy - Snappy native library not loaded
2015-07-03 02:33:44,278 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2015-07-03 02:33:44,430 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-07-03 02:33:44,768 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 84
2015-07-03 02:33:44,768 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0
2015-07-03 02:33:44,769 INFO crawl.InjectorJob - Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03
如果有人能帮我解决这个问题,我将不胜感激,基本上我所有的网址都被拒绝了,我不确定为什么。
谢谢 -哈迪
好吧,在花了很多时间试图解决问题之后...自从我更改了 conf/regex-urlfilter.txt,我不得不使用 "ant runtime" 重建 nutch...等等最终成功了,所以我过去 2 天的结论和教训是,在 conf 更改后总是编译 nutch。
如果您使用的是 /local 运行时环境,则无需为 conf/ 文件中的每个更改重新编译。
构建 nutch 的运行时(使用 >ant 运行时)后,编译会在 $NUTCH_HOME/runtime/local
下创建 /local 环境。在此之下,有一个conf/目录,本质上是$NUTCH_HOME/conf
的副本。
但是,您可以(并且应该)编辑那里的文件以更改 /local 配置。
因此,如果您想更改爬虫的名称,例如,将 $NUTCH_HOME/runtime/local/conf/nutch-site.xml
和 add/edit 以及 属性 http.agent.name
编辑为您想要的任何名称。