如何使用 Apache Nutch 抓取特定网站？

Question

我已按照以下 url 并成功完成，直到 逐步：反向链接

https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website

但是我没有得到任何关于它们的数据

我是这个技术的新手，

如果有人做过成功请给steps/demo/site/example。和请不要给出粗略的步骤。

Answer 1

首先安装螺母：

在nutch-site.xml的配置下，粘贴：

<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

在你的nutch-default.xml下添加

<property>
  <name>http.robot.rules.whitelist</name>
  <value>http://nihilent.com/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

下regex-urlfilter.txt:

# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/

并评论

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

然后运行下面的命令

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

现在检查 crawl/crawldb 文件夹和其他文件夹中的数据是否成功。

Answer 2

下面是一些命令，可以帮助您以各种方式使用 Nutch

这些命令包括控制台直接爬虫，大数据读取dumpin等
我提到的是我已经完成的所有可用命令，请根据您的要求进行修改

Nutch 命令

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s4=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename

bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1

bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext

bin/nutch parsechecker -dumpText http://nihilent.com/

bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3

bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn

bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs

hadoop fs -copyFromLocal 

hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs

添加新答案只是因为避免了三明治数据

如何使用 Apache Nutch 抓取特定网站？

how to crawl particular website using Apache Nutch?

apache

nutch

Nutch 命令