提取网页中的所有链接

Question

我想为我的调查提取印度所有政府网站的列表。

列表可在此处找到： http://goidirectory.nic.in/index.php

这里的问题是列表不是link的形式。每当我需要打开一个网站时，它都会打开一个新选项卡，然后从那里重定向到请求的网站。

因此，google klipper 和其他从网站提取 links 的工具无法正常工作。

我对javascript一无所知。

我注意到的一件事是，当我将鼠标指针指向 link 时，它会显示网站名称 link，如下所示：

喜欢，例如 http://presidentofindia.gov.in 出现在亮点中。

我需要此类网站的列表 links

谢谢

Answer 1

您好，请检查https://jsfiddle.net/9b0wL9tn/

jQuery

$(document).ready(function(){
    $('a').each(function(){
  console.log($(this).attr('href'));
});
});

注意：在 chrome 中打开网站 >> 右键单击 >> 检查 >> 转到控制台选项卡并粘贴以下内容并按 enter

运行此代码首先出现在控制台上：

var jq = document.createElement('script');
jq.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(jq);
// ... give time for script to load, then type.
jQuery.noConflict();

然后运行这个

$('a').each(function(){
      console.log($(this).attr('href'));
});

这将列出页面上的所有链接，只需从控制台复制即可

更新

按照前面的步骤更新了脚本...运行控制台中的以下脚本：

var arr=new Array();
jQuery('a').each(function(i){


arr[i]=jQuery(this).attr('title')+"";


});

jQuery.each(arr,function(i){
if(arr[i].indexOf('http')>-1)
console.log(arr[i].substr(0, arr[i].indexOf('-')));
});

这里是屏幕截图：http://www.imageno.com/lj7tuyr9pt2opic.html

提取网页中的所有链接

Extracting all links in a webpage

html

javascript

web-crawler

html-parsing