如何使用 Apache Nutch 抓取特定网站?
how to crawl particular website using Apache Nutch?
我已按照以下 url 并成功完成,直到 逐步:反向链接
https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website
但是我没有得到任何关于它们的数据
我是这个技术的新手,
如果有人做过成功请给steps/demo/site/example。
和
请不要给出粗略的步骤。
首先安装螺母:
在nutch-site.xml的配置下,粘贴:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
在你的nutch-default.xml下添加
<property>
<name>http.robot.rules.whitelist</name>
<value>http://nihilent.com/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
下regex-urlfilter.txt:
# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/
并评论
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
然后运行下面的命令
bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
现在检查 crawl/crawldb 文件夹和其他文件夹中的数据是否成功。
下面是一些命令,可以帮助您以各种方式使用 Nutch
- 这些命令包括控制台直接爬虫,大数据读取dumpin等
- 我提到的是我已经完成的所有可用命令,请根据您的要求进行修改
Nutch 命令
bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s4=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename
bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1
bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext
bin/nutch parsechecker -dumpText http://nihilent.com/
bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3
bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn
bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs
hadoop fs -copyFromLocal
hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs
添加新答案只是因为避免了三明治数据
我已按照以下 url 并成功完成,直到 逐步:反向链接
https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website
但是我没有得到任何关于它们的数据
我是这个技术的新手,
如果有人做过成功请给steps/demo/site/example。 和 请不要给出粗略的步骤。
首先安装螺母:
在nutch-site.xml的配置下,粘贴:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
在你的nutch-default.xml下添加
<property>
<name>http.robot.rules.whitelist</name>
<value>http://nihilent.com/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
下regex-urlfilter.txt:
# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/
并评论
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
然后运行下面的命令
bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
现在检查 crawl/crawldb 文件夹和其他文件夹中的数据是否成功。
下面是一些命令,可以帮助您以各种方式使用 Nutch
- 这些命令包括控制台直接爬虫,大数据读取dumpin等
- 我提到的是我已经完成的所有可用命令,请根据您的要求进行修改
Nutch 命令
bin/nutch inject crawl/crawldb dmoz bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s4=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename
bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1
bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext
bin/nutch parsechecker -dumpText http://nihilent.com/
bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3
bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn
bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs
hadoop fs -copyFromLocal
hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs
添加新答案只是因为避免了三明治数据