如何阻止我的代码在获取网站搜索结果时返回 UnknownHostException？

Question

我编写了一个 Java 程序，它使用 Jsoup 库在“freewebnovel.com”上搜索内容，然后打印出搜索结果。大约一周前它一直在工作，但现在每次我运行它都会给出 Java.net.UnknownHostException。我检查了网站，看看是否有任何变化，但我找不到任何东西。我添加了一个 UserAgent，但这并没有真正帮助。我也很好奇 link 末尾的斜线是否有所不同。

import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.Scanner;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class testingFreeWebnovelSearchWithJsoup {
    public static void main(String[] args){
        try{
            Scanner scan = new Scanner(System.in);
            System.out.println("Type in what you want to search:");
            String searchTerm = scan.nextLine();
            Response response = Jsoup.connect("https://freewebnovel.com/search/")
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36")
                    .timeout(10000)
                    .method(Connection.Method.POST)
                    .data("searchkey", searchTerm)
                    .followRedirects(true)
                    .execute();

            Document doc = response.parse();
            Map<String, String> mapCookies = response.cookies();


            Elements searchResults = doc.select("img[src$=.jpg]");
            List<String> titles = searchResults.eachAttr("title");
            List<String> images = searchResults.eachAttr("src");

            System.out.println(titles);
        } catch(IOException e){
            System.out.println("You had an error: " + e);
        }
    }
}

Answer 1

It was working around a week ago pretty consistently but now it gives out Java.net.UnknownHostException every time I run it.

这意味着站点的 DNS 名称未解析。那可能是条目丢失了……或陈旧了……或者有一个本地（对您而言）与您的绑定配置或上游 DNS 服务器。

Ah the error is not UnknownHostException, it is:

You had an error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=[freewebnovel.com/search] –

这意味着服务器对您说禁止访问。请注意，这是 403 而不是 401。

显然，服务器不希望您像那样获取该页面。

（您是否检查过 403 响应中是否有正文？这可能包含一条信息更丰富的错误消息。）

I have added a UserAgent and that didn't really help.

网站有多种方法来检测进行网络抓取的人。欺骗“UserAgent”字段以假装您的客户端是浏览器是更容易检测到的事情之一。

（我认为向您提供有关如何抓取不想被抓取的网站的教程不是一个好主意。）

I am also curious if the slash at the end of the link makes a difference.

我怀疑。当我从浏览器访问网站时，结尾的斜杠没有任何区别。

现在我检查了网站的条款和条件页面，它没有提到网络抓取。此外，该网站的 robots.txt 似乎在说机器人无处不在。

但 T&C 页面有拼写错误等，所以我的猜测是匆忙拼凑而成，可能无法反映网站所有者当前对抓取的意愿。

有该站点的联系电子邮件。所以我的建议是 给他们发电子邮件 解释你在做什么（以及为什么！）并询问他们如何进行。如果他们不想让您抓取他们的网站，他们应该告诉您。（你应该停止尝试！）

但请注意，这以前有效，但现在给出 403 可能意味着他们已经看到您和其他人的抓取 activity 并试图阻止它。（并且尚未更新条款和条件以表明他们的意愿。）

如何阻止我的代码在获取网站搜索结果时返回 UnknownHostException？

How to stop my code from returning UnknownHostException when getting search results of a website?

java

jsoup