如何使用 Jsoup 提取 link?
How to extract a link with Jsoup?
我正在使用 JSoup 抓取网络并获取结果。我想执行关键字搜索。例如我爬行
http://www.business-standard.com/ 以下关键字:
google hyderabad
它应该为我提供 link:
我写了下面的代码,但没有给我合适的结果。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.business-standard.com").userAgent("Mozilla").get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
结果如下:
title : India News, Latest News Headlines, BSE live, NSE Live, Stock Markets Live, Financial News, Business News & Market Analysis on Indian Economy - Business Standard News
link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-google-power-2574.htm
text : Mumbai Central turns into Wi-Fi zone, courtesy Google power
link : plus.google.com/+businessstandard/posts
text : Google+
Jsoup 1.8.2
试试这个 url:
http://www.business-standard.com/search?q=<keyword>
示例代码
Document doc;
try {
String keyword = "google hyderabad";
doc = Jsoup //
.connect("http://www.business-standard.com/search?q=" + URLEncoder.encode(keyword, "UTF-8")) //
.userAgent("Mozilla") //
.get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.absUrl("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
输出
您要找的link在第二位。
title : Search
link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html
text : Google to invest more in India, set up new campus in Hyderabad
link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html
text : Google to get 7.2 acres in Hyderabad IT corridor for its campus
link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html
text : Swine flu closes Google Hyderabad office for 2 days
link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html
text : Facebook posts strong 4Q as company closes gap with Google
link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html
text : R-Day: BSF camel contingent marches on Google doodle
link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html
text : Daimler CEO says Apple, Google making progress on car
link : https://plus.google.com/+businessstandard/posts
text : Google+
讨论
下面的示例代码仅获取第一个结果页面。如果您需要获取更多结果,请提取下一个 link 页面 (#hpcontentbox div.next-colum > a
) 并使用 Jsoup 对其进行抓取。
您会注意到上面还有其他参数 link 我为您提供了:
itemPerPages
:不言自明(默认为 19)
page
: 搜索结果页面索引(如果不提供则默认为1)
company-code
:?? (可以为空)
您可以尝试将 itemPerPages
赋给具有更大值(100 或更多)的 url。这可能会减少您的抓取时间。
使用absUrl
方法是为了获得绝对url而不是相对url。
我正在使用 JSoup 抓取网络并获取结果。我想执行关键字搜索。例如我爬行 http://www.business-standard.com/ 以下关键字:
google hyderabad
它应该为我提供 link:
我写了下面的代码,但没有给我合适的结果。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.business-standard.com").userAgent("Mozilla").get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
结果如下:
title : India News, Latest News Headlines, BSE live, NSE Live, Stock Markets Live, Financial News, Business News & Market Analysis on Indian Economy - Business Standard News link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-google-power-2574.htm text : Mumbai Central turns into Wi-Fi zone, courtesy Google power link : plus.google.com/+businessstandard/posts text : Google+
Jsoup 1.8.2
试试这个 url:
http://www.business-standard.com/search?q=<keyword>
示例代码
Document doc;
try {
String keyword = "google hyderabad";
doc = Jsoup //
.connect("http://www.business-standard.com/search?q=" + URLEncoder.encode(keyword, "UTF-8")) //
.userAgent("Mozilla") //
.get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.absUrl("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
输出
您要找的link在第二位。
title : Search
link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html
text : Google to invest more in India, set up new campus in Hyderabad
link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html
text : Google to get 7.2 acres in Hyderabad IT corridor for its campus
link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html
text : Swine flu closes Google Hyderabad office for 2 days
link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html
text : Facebook posts strong 4Q as company closes gap with Google
link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html
text : R-Day: BSF camel contingent marches on Google doodle
link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html
text : Daimler CEO says Apple, Google making progress on car
link : https://plus.google.com/+businessstandard/posts
text : Google+
讨论
下面的示例代码仅获取第一个结果页面。如果您需要获取更多结果,请提取下一个 link 页面 (#hpcontentbox div.next-colum > a
) 并使用 Jsoup 对其进行抓取。
您会注意到上面还有其他参数 link 我为您提供了:
itemPerPages
:不言自明(默认为 19)page
: 搜索结果页面索引(如果不提供则默认为1)company-code
:?? (可以为空)
您可以尝试将 itemPerPages
赋给具有更大值(100 或更多)的 url。这可能会减少您的抓取时间。
使用absUrl
方法是为了获得绝对url而不是相对url。