使用 Floki 和 HttPotion 的 Elixir 脚本无法解析 url
Elixir script using Floki and HttPotion fails to parse url
我正在尝试使用 Floki and HttPotion 为维基百科文章的文本编写脚本。我的失败代码如下所示:
defmodule Scraper do
def start do
base = "https://en.wikipedia.org"
response = HTTPotion.get base <> "/wiki/Main_Page"
html = response.body
main_bg = Floki.find(html, ".MainPageBG")
main_bg
|> Floki.find("table tr li a")
|> Floki.attribute("href")
|> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
end
end
我引用了 Floki 自述文件中的内容:
html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)
当我将结果通过管道传输到 Floki.attribute("href")
时,我得到了一个很好的 url 路径名列表,例如:
["/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Wikipedia:Today%27s_featured_article/November_2015wow",
"https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow",
"/wiki/Wikipedia:Featured_articleswow", "/wiki/Schloss_Krobnitzwow",
"/wiki/Prussiawow", "/wiki/Albrecht_von_Roonwow", "/wiki/Harry_Winerwow",
"/wiki/Rob_Thomas_(writer)wow", "/wiki/Of_Vice_and_Menwow",
"/wiki/Veronica_Marswow", "/wiki/Meithalunwow", "/wiki/Palestinian_peoplewow",
"/wiki/Marj_Sanurwow", "/wiki/Soma_Norodomwow",...]
但是,当行 |> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
运行时,我得到这个错误:
** (HTTPotion.HTTPError) {:url_parsing_failed, {:error, :invalid_uri}}
(httpotion) lib/httpotion.ex:209: HTTPotion.handle_response/1
(elixir) lib/enum.ex:977: anonymous fn/3 in Enum.map/2
(elixir) lib/enum.ex:1261: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir) lib/enum.ex:977: Enum.map/2
我看到 :url_parsing_failed
,但我不明白为什么。当我尝试 Enum.map(fn(addr) -> HTTPotion.get(base <> addr)
与列表中的单个 url 路径时,它们都有效。
- 我的语法有误吗?
- 我是否遗漏了一些有关管道或枚举如何工作的信息?
- 我走在正确的轨道上吗?
根据 manukall 的回答,这里是有效的:
defmodule Scraper do
def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
def transform_url(url, _base), do: url
def start do
base = "https://en.wikipedia.org"
response = HTTPotion.get base <> "/wiki/Main_Page"
html = response.body
main_bg = Floki.find(html, ".MainPageBG")
main_bg
|> Floki.find("table tr li a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> OldRazor.transform_url(url, base) end)
|> Enum.map(fn(url) -> HTTPotion.get(url) end)
end
end
如果您再次仔细查看 url 的列表,您会注意到其中有一个绝对值 url:“https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow". This won't work with HTTPotion.get(base <> addr)
, because it will end up requesting a url like "https://en.wikipedia.orghttps://lists.wikimedia.org/mailman/listinfo/daily-article-lwow”。
解决这个问题的一种方法是编写另一个函数 transform_url
检查值是否以 /
开头,然后才将基数 url 添加到它前面:
def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
def transform_url(url, _base), do: url
然后您可以将其用作
...
|> Enum.map(fn(url) -> HTTPoison.get!(transform_url((url)) end)
我正在尝试使用 Floki and HttPotion 为维基百科文章的文本编写脚本。我的失败代码如下所示:
defmodule Scraper do
def start do
base = "https://en.wikipedia.org"
response = HTTPotion.get base <> "/wiki/Main_Page"
html = response.body
main_bg = Floki.find(html, ".MainPageBG")
main_bg
|> Floki.find("table tr li a")
|> Floki.attribute("href")
|> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
end
end
我引用了 Floki 自述文件中的内容:
html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)
当我将结果通过管道传输到 Floki.attribute("href")
时,我得到了一个很好的 url 路径名列表,例如:
["/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Wikipedia:Today%27s_featured_article/November_2015wow",
"https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow",
"/wiki/Wikipedia:Featured_articleswow", "/wiki/Schloss_Krobnitzwow",
"/wiki/Prussiawow", "/wiki/Albrecht_von_Roonwow", "/wiki/Harry_Winerwow",
"/wiki/Rob_Thomas_(writer)wow", "/wiki/Of_Vice_and_Menwow",
"/wiki/Veronica_Marswow", "/wiki/Meithalunwow", "/wiki/Palestinian_peoplewow",
"/wiki/Marj_Sanurwow", "/wiki/Soma_Norodomwow",...]
但是,当行 |> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
运行时,我得到这个错误:
** (HTTPotion.HTTPError) {:url_parsing_failed, {:error, :invalid_uri}}
(httpotion) lib/httpotion.ex:209: HTTPotion.handle_response/1
(elixir) lib/enum.ex:977: anonymous fn/3 in Enum.map/2
(elixir) lib/enum.ex:1261: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir) lib/enum.ex:977: Enum.map/2
我看到 :url_parsing_failed
,但我不明白为什么。当我尝试 Enum.map(fn(addr) -> HTTPotion.get(base <> addr)
与列表中的单个 url 路径时,它们都有效。
- 我的语法有误吗?
- 我是否遗漏了一些有关管道或枚举如何工作的信息?
- 我走在正确的轨道上吗?
根据 manukall 的回答,这里是有效的:
defmodule Scraper do
def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
def transform_url(url, _base), do: url
def start do
base = "https://en.wikipedia.org"
response = HTTPotion.get base <> "/wiki/Main_Page"
html = response.body
main_bg = Floki.find(html, ".MainPageBG")
main_bg
|> Floki.find("table tr li a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> OldRazor.transform_url(url, base) end)
|> Enum.map(fn(url) -> HTTPotion.get(url) end)
end
end
如果您再次仔细查看 url 的列表,您会注意到其中有一个绝对值 url:“https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow". This won't work with HTTPotion.get(base <> addr)
, because it will end up requesting a url like "https://en.wikipedia.orghttps://lists.wikimedia.org/mailman/listinfo/daily-article-lwow”。
解决这个问题的一种方法是编写另一个函数 transform_url
检查值是否以 /
开头,然后才将基数 url 添加到它前面:
def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
def transform_url(url, _base), do: url
然后您可以将其用作
...
|> Enum.map(fn(url) -> HTTPoison.get!(transform_url((url)) end)