BFS 五字母单词链

Question

我需要一些关于 BFS 字链作业的帮助。单词链基于五个字母的单词，当单词 x 的最后四个字母在单词 y 中时，两个单词连接在一起。例如 climb 和 blimp 是相连的，因为 climb 中的 l、i、m 和 b 在单词 blimp 中。

建议使用 Sedgewick 算法第 4 版中的定向 BFS 或对其进行修改。可以在此处找到代码：https://algs4.cs.princeton.edu/40graphs/ 并使用以下代码读取列表单词的数据文件：

BufferedReader r =
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
ArrayList<String> words = new ArrayList<String>();
while (true) {
    String word = r.readLine();
    if (word == null) { break; }
    assert word.length() == 5;  // inputcheck, if you run with assertions
    words.add(word);
}

以及从文件中读取测试用例的以下代码：

BufferedReader r = 
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
while (true) {
    String line = r.readLine();
    if (line == null) { break; }
    assert line.length() == 11; // inputcheck, if you run with assertions
    String start = line.substring(0, 5);
    String goal = line.substring(6, 11);
    // ... search path from start to goal here
}

数据文件中的单词是：

their
moist
other
blimp
limps
about
there
pismo
abcde
bcdez
zcdea
bcdef
fzcde

当使用测试用例文件时...

other there
other their
their other
blimp moist
limps limps
moist limps
abcde zcdea

...输出应该是每个单词对之间的边数，如果单词之间没有路径则为-1。

我刚接触图形，我不确定如何使用 Sedgewick 的 BFS 并修改它以读取测试用例文件。感谢任何帮助。

Answer 1

我们假设 n 是数据集中的单词数。

首先，我们需要根据给定的条件为上述所有单词建立一个邻接表，即x和y之间存在一条边当且仅当最后四个字母x 出现在 y 中。构建这个邻接表是一个 O(n^2 * w) 操作，其中 w 是数据集中每个单词的平均大小。

其次，我们所要做的就是对测试数据进行传统的 BFS。

这是 main 函数：

    public static void main(String[] args) throws IOException {
        // get words from dataset
        List<String> words = readData();
        // get the word pairs to test
        List<List<String>> testData = getTestData();
        // form an adjacency list
        Map<String, List<String>> adj = getAdjacencyList(words);
        
        // for each test, do a traditional BFS
        for (List<String> test : testData) {
            System.out.println(bfs(adj, test));
        }
    }

下面是根据给定条件构建邻接表的函数：

    public static Map<String, List<String>> getAdjacencyList(List<String> words) {
        Map<String, List<String>> adj = new HashMap<>();
        for (int i = 0; i < words.size(); ++i) {
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList<>()));
            for (int j = 0; j < words.size(); ++j) {
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k < 5; ++k) {
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                }
                // if the condition is satisfied, there exists an edge from `word` to `other`
                if (count >= 4)
                    adj.get(word).add(other);
            }
        }

        return adj;
    }

这是 BFS：

    public static int bfs(Map<String, List<String>> adj, List<String> test) {
        Queue<String> q = new LinkedList<>();
        Set<String> visited = new HashSet<>(); // to keep track of the visited words, since the graph is not necessarily a DAG
        String start = test.get(0);
        String end = test.get(1);
        // if `start` and `end` words are equal
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) {
            count++;
            int size = q.size();
            for (int i = 0; i < size; ++i) {
                String word = q.poll();
                for (String val : adj.get(word)) {
                    if (val.equals(end))
                        return count; // return the number of edges
                    if (!visited.contains(val)) // only add the words which aren't visited yet.
                        q.add(val);
                }
            }
        }
        return -1; // if there isn't any edge
    }

Answer 2

@The Room 提供了一个很好的答案，但我想建议对邻接列表构造部分进行简单修改，因为提供的构建列表的方法复杂度为 O(n^2)，这将导致大型输入文件性能不佳。

简单地说，您可以采用每个单词的 4 个字符的每个可能的 sorted 模式，并将其插入到具有单词 ID（例如索引）的哈希映射中。

C++ 代码示例：

map<string , vector<int> >mappings ;

for(int i = 0 ; i < words.size();  i++){
    string word = words[i].substr(0 , 4) ; 
    sort(word.begin() , word.end()); 
    mappings[word].push_back(i); 
    for(int j = 0 ; j < 4 ; j++){
        word = words[i].substr(0 , 4) ; 
        word[j] = words[i][4]; 
        sort(word.begin() , word.end()); 
        mappings[word].push_back(i);
    }
}

现在你有了一个单词索引向量，你知道它们和任何以向量键的相同 4 个字符结尾的单词之间必须有一条边。

然后您可以简单地构建图形，只需注意不要创建自循环（避免用节点和它本身创建边）。

代码示例：

// Building the graph with complexity of O(n * log(no. of edges))
const int N = 100000; // Just and example 
vector<int>graph[N]; 
for(int i = 0 ; i < words.size(); i++){
    string tmp = words[i].substr(1 , 4); 
    sort(tmp.begin() , tmp.end()); 
    for(int j = 0 ; j < mappings[tmp].size(); j++){
        if (j == mappings[tmp][j])
            continue; 
            
        graph[i].push_back(mappings[tmp][j]);
    }
}

最后你可以遍历你的测试文件，获取开始和目标索引（当读取文件时将每个单词存储为一个键，它的索引值）然后你应用 bfs 函数来计算@The Room

的回答中描述的边缘

我只是想为那些可能需要针对具有大量输入的类似问题的解决方案的人们建议这个答案，这将降低构建图形的复杂性，从 O(N^2) 到 O(N * log(边数)) 其中 N 是单词数。

Answer 3

我的方法略有不同，我将在下面讨论的问题也有细微差别：

首先我们创建一个邻接表：（@Volpe95 对此进行了很好的优化）。在单词是关键的地方使用节点图。

Map<String, Node> nodes = new HashMap<>();

        List<String> words = new DataHelper().loadWords("src/main/wordsInput.dat");
        System.out.println(words);

        for (int i = 0; i < words.size(); i++) {
            String l = words.get(i);
            nodes.put(l, new Node(l));
        }

        for(Map.Entry<String,Node> l: nodes.entrySet()) {
            for(Map.Entry<String, Node> r:nodes.entrySet()) {
                if (l.equals(r)) continue;
                if (isLinkPair(l.getKey(), r.getKey())) {
                    Node t = nodes.get(l.getKey());
                    System.out.println(t);
                    t.addChild(nodes.get(r.getKey()));
                }
            }

        }

IsLinkPair 检查是否可以在可能的 child 单词中找到单词的最后四个字母。

private static boolean isLinkPair(String l, String r) {
        // last 4 chars only
        for (int i = 1; i < l.length(); i++) {
            if(r.indexOf(l.charAt(i)) == -1){
                return false;
            }
        }
        return true;
    }

一个节点存储每个词和children以及edgeTo，这个用来计算每个节点存储它的parent的最短路径。这 child parent 将始终在最短路径上。（Sedgewick 将这些数据存储在单独的数组中，但通常更容易将它们分组在 class 中，因为它使代码更容易理解）

（为清楚起见省略了 Getters Setters 等）

public class Node {
    private Set<Node> children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) {
        children = new HashSet<>();
        this.word = word;
        edgeTo = null;
    }
}

基于Sedgewick的BFS算法，依次搜索每个节点，它的直接children和它们的children等等。它每次都在远离原点的地方寻找。请注意，使用了一个队列，这是由 Java.

中的 LinkedList 实现的

private boolean bfs(Map<String,Node> map, Node source, Node target) {
        if(source == null || target == null) return false;
        if(source.equals(target))return true;
        Queue<Node> queue = new LinkedList<>();
        source.setVisited();
        queue.add(source);
        while(!queue.isEmpty()) {
            Node v = queue.poll();
            for (Node c : v.getChildren()) {
                if(c.getVisited()==0){
                    System.out.println("visiting " + c);
                    c.setVisited();
                    c.setEdgeTo(v);
                    if(c.equals(target)) {
                        return true;
                    }
                    queue.add(c);
                }
            }
        }

        return false;
    }

注意v是parent，c是它的children。 setEdgeTo用于设置一个childs parent.

最后我们查看source和target分别为源词和目标词的结果：

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

那么我上面提到的细微差别是什么？最短路径计算是必要的，因为 zcdea 有两个 parents fzcde 和 bcdez，你需要最短路径上的那个。要使用 child 的 edgeTo，找到它的 parent 并重复直到路径如下所示。由于 bfs 从原点向外搜索的方式，child parent 关系将始终在最短路径上。

// get edgeTo on target (the parent) , find this node and get its parent
    // continue until the shortest path is walked or no path is found
    public int getShortestPath(Map<String,Node> map, String source, String target) {
        Node node = map.get(target);
        int pathLength = 0;
        do {
            if(node == null || pathLength > map.size()) return NOPATH;
            if(node.equals(map.get(source))) return pathLength;
            node = map.get(node.getWord()).getEdgeTo();
            pathLength++;
        } while (true);
    }

总有 space 时间复杂度权衡需要考虑和优化。

BFS 五字母单词链

BFS five letter word chain

java

algorithm

graph-algorithm

这是 BFS：