测试因意外 Collections.sort 行为而失败

Question

请注意： 我在这里提到了 JUnit 并提供了一个使用它的 SSCCE 代码示例，但这本质上是一个 Java 集合问题，可以由任何有 Java 经验的人回答，无论他们是否使用过 JUnit。

Java 8 在这里，我正在尝试对字符串列表进行排序，但是 Collections.sort(myList) 出现了一些意外行为，我想知道发生了什么。

这是我的完整单元测试：

@RunWith(MockitoJUnitRunner.class)
public class SorterTest {

    @Test
    public void should_sort_correctly_including_capitalization_rules() {

        // given
        String[] actualNames = new String[] {
            "DCME",
            "CCME",
            "ACME",
            "BCME",
            "AGME",
            "AACME",
            "aCME",
            "Acme",
            "AaCME",
            "aACME",
        };
        List<String> actual = Arrays.asList(actualNames);

        // the order I would *expect* them to sort into...
        String[] expectedNames = new String[] {
                "aACME",
                "aCME",
                "AaCME",
                "AACME",
                "Acme",
                "ACME",
                "AGME",
                "BCME",
                "CCME",
                "DCME"
        };
        List<String> expected = Arrays.asList(expectedNames);

        // when
        Collections.sort(actual);

        // then
        assertTrue(actual.equals(expected));

    }

}

此处的 JUnit assertTrue 在运行时失败，因为 actual 列表被分类为：

0 = "AACME"
1 = "ACME"
2 = "AGME"
3 = "AaCME"
4 = "Acme"
5 = "BCME"
6 = "CCME"
7 = "DCME"
8 = "aACME"
9 = "aCME"

这就是 ^^^ 调试器输出，数字代表每个元素的列表索引。

所以出于某种原因 Collections.sort 是说字符串“BCME”在字典上比“aCME”“低”（在排序列表中出现得更早），这对我来说简直是疯了。 :-)

我应该提一下，我在这里只处理 UTF-8 中的 ASCII 字符，但我的应用程序将执行预验证，以确保每个 string/name 中的所有字符都在 [a-z][A-Z].

无论哪种方式，我正在寻找要使用的 Java 代码的排序规则是：

当我说“更低”时，我的意思是“将在排序列表中出现得更早”，而当我说“更高”时我的意思是“稍后会出现在排序列表中”
- 因此我会说“3 小于 43”，因为在排序的整数列表中，3 将比 43 更早出现在该列表中，等等。
小写字母比大写字母小；所以“a”应该出现在“A”之前
- 因此所有字母的顺序是aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
较短的单词出现在较长的单词之前，前提是它们是较长单词的相同（包括大小写）子集
- “but”低于（先于）“butterfly”
- “蝴蝶”低于“但是”(b < B)
- "butterfly" 比 "bUt" 低（b 和 b 相同，但 u < U）

根据这些排序规则，我的单元测试列表应排序为：

Sort Order   Reason why it comes after the last one in the list
================================================================
aACME        
aCME         1st letter is 'a' but 2nd letter is 'C' and A < C
AaCME        1st letter is 'A' and a < A
AACME        1st letter is 'A' and 2nd letter is 'A' and a < A
Acme         1st letter is 'A' but 2nd letter is 'c' and A < c
ACME         1st letter is 'A' but 2nd letter is 'C' and c < C
AGME         1st letter is 'A' but 2nd letter is 'G' and C < G
BCME         1st letter is 'B' and aA < bB
CCME         1st letter is 'C' and bB < cC
DCME         1st letter is 'D' and cC < dD

如何更改上面的代码，以便单元测试通过并且列表按我需要的方式排序？

Answer 1

写一个比较器来按照你想要的方式对事物进行排序。我们不会为您编写它，但是将比较器 map/translate 字符串放入相应的排序键中应该很简单...

例如，假设只有[A-Za-z] map

a->0x00
A->0x01 
b->0x02
B->0x03

等等

请记住，比较器将多次访问元素，因此如果数据量足够大（例如，>10⁶ 个字符串）并且性能是个问题，您可能必须缓存排序键。

Answer 2

假设你只有字母，你可以定义一个类似这样的比较器：

Comparator<String> comparator = (a, b) -> {
    // Compare the characters pairwise.
    for (int i = 0, m = Math.min(a.length(), b.length()); i < m; ++i) {
      char aa = a.charAt(i);
      char bb = b.charAt(i);
      // If one is lowercase but the other isn't, say that the lowercase comes first.
      if (Character.isLowerCase(aa) != Character.isLowerCase(bb)) {
        return Character.isLowerCase(aa) ? -1 : 1;
      }

      // If the characters are the same case but aren't the same, say the lexicographically first one is first.
      if (aa != bb) {
        return aa < bb ? -1 : 1;
      }
    }
    // If the pair-wise comparison doesn't find a difference, say the shortest one is first; or they are equal if the same length.
    return Integer.compare(a.length(), b.length());
};

Answer 3

Java 有 class RuleBasedCollator 允许自定义 sorting/ordering 个字符。

在这种情况下，小写字母应位于大写字母之前，因此规则可能如下所示：

static RuleBasedCollator lowerFirst() {
    try {
        return new RuleBasedCollator(
            "< a < A < b < B < c < C < d < D < e < E < f < F < g < G < h < H < i < I < j < J < "
            + "k < K < l < L < m < M < n < N < o < O < p < P < q < Q < r < R < s < S < t < T < "
            + "u < U < w < W < x < X < y < Y < z < Z"
        );
    } catch (ParseException parsex) {
        throw new IllegalArgumentException("Failed to create lowerFirst collator", parsex);
    }
}

测试：

String[] names = new String[] {
    "DCME",  "CCME", "ACME", "BCME",  "AGME",
    "AACME", "aCME", "Acme", "AaCME", "aACME",
};

String[] expected = new String[] {
    "aACME", "aCME", "AaCME", "AACME", "Acme",
    "ACME", "AGME", "BCME", "CCME", "DCME"
};
        
Arrays.sort(names, lowerFirst());

System.out.println("sorted:   " + Arrays.toString(names));
System.out.println("expected: " + Arrays.toString(expected));

输出

sorted:   [aACME, aCME, AaCME, AACME, Acme, ACME, AGME, BCME, CCME, DCME]
expected: [aACME, aCME, AaCME, AACME, Acme, ACME, AGME, BCME, CCME, DCME]

测试因意外 Collections.sort 行为而失败

Test failing for unexpected Collections.sort behavior

java

sorting

collections

junit