根据 unicode 对字符串进行排序
Sorting strings with regard to unicode
我有一个列表,我想按字母顺序排序,但关于 unicode
iex(2)> ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"] |> Enum.sort
["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"]
# the above is wrong, it should be:
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
我如何在 Elixir 中实现它?使用一些 Hex 包是可以接受的。
到目前为止,由于使用的字母表定义明确,我最终创建了自己的排序函数:
defp numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
然后
Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(numeric_for_sort(&1["name"]) <= numeric_for_sort(&2["name"])))
远非完美,但有效。
Far from perfect, but works.
它对我不起作用:
my.exs:
defmodule Stuff do
def numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
end
^C~/elixir_programs$ iex my.exs
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(Stuff.numeric_for_sort(&1["name"]) <= Stuff.numeric_for_sort(&2["name"])))
** (FunctionClauseError) no function clause matching in Access.get/3
The following arguments were given to Access.get/3:
# 1
"lubelskie"
# 2
"name"
# 3
nil
(elixir) lib/access.ex:306: Access.get/3
(stdlib) erl_eval.erl:670: :erl_eval.do_apply/6
(stdlib) erl_eval.erl:878: :erl_eval.expr_list/6
(stdlib) erl_eval.erl:404: :erl_eval.expr/5
(stdlib) erl_eval.erl:469: :erl_eval.expr/5
(stdlib) lists.erl:969: :lists.sort/2
(FunctionClauseError) no function clause matching in Access.get/3`.
而且,我认为您不想使用字母列表,因为这样您就必须不断遍历列表来搜索字母。这就是地图的用途。 (编辑:嗯,我知道什么:small maps 是有序列表,其中地图有 <= 31 个条目)所以,像这样:
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
letter_rank = Map.new Enum.with_index letters
String.graphemes(string)
|> Enum.map(fn(x) -> letter_rank[x] end)
然后:
names = ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(2)> Enum.sort_by names, &Stuff.numeric_for_sort/1
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(3)>
根据 Enum.sort_by/3 文档:
sort_by/3 differs from sort/2 in that it only calculates the comparison value for each element in the enumerable once instead of
once for each element in each comparison. If the same function is
being called on both elements, it’s also more compact to use
sort_by/3.
排序时进行了多次比较,排序算法每次比较都要一遍又一遍地计算每个名字的数字列表显然不太理想。
请注意,即使这一行:
Enum.sort_by names, &Stuff.numeric_for_sort/1
看起来是在调用sort_by/2,实际上是在调用sort_by/3,默认的第三个参数是&<=/2
。
处理排序的正确方法是将所有字符带到 decomposed unicode form 并排序。问题是出于某种原因 "ł"
不被视为组合形式:
letters
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.codepoints/1)
#⇒ [
# ["a"],
# ["a", "̨"],
# ["b"],
# ["c"],
# ["c", "́"],
# ["d"],
# ["e"],
# ["e", "̨"],
# ["f"],
# ["g"],
# ["h"],
# ["i"],
# ["j"],
# ["k"],
# ["l"],
# ["ł"],
# ["m"],
# ["n"],
# ["n", "́"],
# ["o"],
# ["o", "́"],
# ["p"],
# ["q"],
# ["r"],
# ["s"],
# ["s", "́"],
# ["t"],
# ["u"],
# ["w"],
# ["y"],
# ["z"],
# ["z", "́"],
# ["z", "̇"]
# ]
我不知道为什么 "ł"
没有声明为组合字母,而且我认为这是联盟文件中的一个错误。无论如何,我们可能会骗过分拣机:
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.replace(&1, "ł", "l�"))
|> Enum.sort()
|> Enum.map(&String.replace(&1, "l�", "ł"))
#⇒ ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
现在它可以处理任何输入,包括组合的和分解的。
我有一个列表,我想按字母顺序排序,但关于 unicode
iex(2)> ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"] |> Enum.sort
["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"]
# the above is wrong, it should be:
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
我如何在 Elixir 中实现它?使用一些 Hex 包是可以接受的。
到目前为止,由于使用的字母表定义明确,我最终创建了自己的排序函数:
defp numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
然后
Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(numeric_for_sort(&1["name"]) <= numeric_for_sort(&2["name"])))
远非完美,但有效。
Far from perfect, but works.
它对我不起作用:
my.exs:
defmodule Stuff do
def numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
end
^C~/elixir_programs$ iex my.exs
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(Stuff.numeric_for_sort(&1["name"]) <= Stuff.numeric_for_sort(&2["name"])))
** (FunctionClauseError) no function clause matching in Access.get/3
The following arguments were given to Access.get/3:
# 1
"lubelskie"
# 2
"name"
# 3
nil
(elixir) lib/access.ex:306: Access.get/3
(stdlib) erl_eval.erl:670: :erl_eval.do_apply/6
(stdlib) erl_eval.erl:878: :erl_eval.expr_list/6
(stdlib) erl_eval.erl:404: :erl_eval.expr/5
(stdlib) erl_eval.erl:469: :erl_eval.expr/5
(stdlib) lists.erl:969: :lists.sort/2
(FunctionClauseError) no function clause matching in Access.get/3`.
而且,我认为您不想使用字母列表,因为这样您就必须不断遍历列表来搜索字母。这就是地图的用途。 (编辑:嗯,我知道什么:small maps 是有序列表,其中地图有 <= 31 个条目)所以,像这样:
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
letter_rank = Map.new Enum.with_index letters
String.graphemes(string)
|> Enum.map(fn(x) -> letter_rank[x] end)
然后:
names = ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(2)> Enum.sort_by names, &Stuff.numeric_for_sort/1
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(3)>
根据 Enum.sort_by/3 文档:
sort_by/3 differs from sort/2 in that it only calculates the comparison value for each element in the enumerable once instead of once for each element in each comparison. If the same function is being called on both elements, it’s also more compact to use sort_by/3.
排序时进行了多次比较,排序算法每次比较都要一遍又一遍地计算每个名字的数字列表显然不太理想。
请注意,即使这一行:
Enum.sort_by names, &Stuff.numeric_for_sort/1
看起来是在调用sort_by/2,实际上是在调用sort_by/3,默认的第三个参数是&<=/2
。
处理排序的正确方法是将所有字符带到 decomposed unicode form 并排序。问题是出于某种原因 "ł"
不被视为组合形式:
letters
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.codepoints/1)
#⇒ [
# ["a"],
# ["a", "̨"],
# ["b"],
# ["c"],
# ["c", "́"],
# ["d"],
# ["e"],
# ["e", "̨"],
# ["f"],
# ["g"],
# ["h"],
# ["i"],
# ["j"],
# ["k"],
# ["l"],
# ["ł"],
# ["m"],
# ["n"],
# ["n", "́"],
# ["o"],
# ["o", "́"],
# ["p"],
# ["q"],
# ["r"],
# ["s"],
# ["s", "́"],
# ["t"],
# ["u"],
# ["w"],
# ["y"],
# ["z"],
# ["z", "́"],
# ["z", "̇"]
# ]
我不知道为什么 "ł"
没有声明为组合字母,而且我认为这是联盟文件中的一个错误。无论如何,我们可能会骗过分拣机:
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.replace(&1, "ł", "l�"))
|> Enum.sort()
|> Enum.map(&String.replace(&1, "l�", "ł"))
#⇒ ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
现在它可以处理任何输入,包括组合的和分解的。