从文本文件获取输入并存储到数组中，但文本文件包含超过 20.000 个字符串

Question

从文本文件获取输入并将其存储到数组中，但文本文件包含超过 20.000 个字符串。我正在尝试从文本文件中读取字符串并将它们存储到一个巨大的数组中。我该怎么做？

我不能使用矢量。是否可以不使用散列来做到这一点 table?

之后，我会尝试使用排序找到最常用的单词。

Answer 1

假设您使用的是 C 风格/原始数组，您可以这样做：

const size_t number_of_entries = count_entries_in_file();

//Make sure we actually have entries
assert(number_of_entries > 0);

std::string* file_entries = new std::string[number_of_entries];

//fill file_entries with the files entries
//...

//release heap memory again, so we don't create a leak

delete[] file_entries;
file_entries = nullptr;

Answer 2

您不需要将整个文件保存在内存中来计算词频。您只需要保留一个条目和一些数据结构来计算频率，例如 std::unordered_map<std::string,unsigned>.

未测试：

std::unordered_map<std::string,unsigned> processFileEntries(std::ifstream& file) { 
    std::unordered_map<std::string,unsigned> freq;
    std::string word;
    
    while ( file >> entry ) {
              ++freqs[entry];
    }
    return freq;
}

为了更高效的阅读或更精细的处理，您还可以读取文件的块（例如 100 个单词）、处理块，然后继续下一个块。

Answer 3

您的要求是不要使用任何标准容器，例如 std::vector 或 std::unordered_map。

这种情况下我们需要自己创建一个动态容器。那并不复杂。我们甚至可以用它来存储字符串。所以，我什至不会在我的示例中使用 std::string。

我用约 700 行 code.

为您创建了一些演示

我们先来定义“容量”这个词。这是可以存储在容器中的元素数。它是当前可用的 space。它与容器中真正存储了多少元素无关。

但是动态容器有一个也是最重要的功能。它必须能够成长。这总是必要的，如果我们想存储更多的元素到容器中，作为它的容量。

所以，如果我们想在容器的末尾添加一个额外的元素，并且如果元素的数量 >= 它的容量，那么我们需要重新分配更大的内存，然后将所有旧元素复制到新记忆space。对于此类事件，我们通常会加倍容量。这应该可以防止频繁的重新分配和复制活动。

让我向您展示一个 push_back 函数的示例，它可以像这样实现：

template <typename T>
void DynamicArray<T>::push_back(const T& d) {               // Add a new element at the end
    if (numberOfElements >= capacity) {                     // Check, if capacity of this dynamic array is big enough
        capacity *= 2;                                      // Obviously not, we will double the capacity
        T* temp = new T[capacity];                          // Allocate new and more memory
        for (unsigned int k = 0; k < numberOfElements; ++k)
            temp[k] = data[k];                              // Copy data from old memory to new memory
        delete[] data;                                      // Release old memory
        data = temp;                                        // And assign newly allocated memory to old pointer
    }
    data[numberOfElements++] = d;                           // And finally, store the given data at the end of the container
}

这是一个基本的方法。我使用模板是为了能够在动态数组中存储任何类型。

您可以通过删除所有模板内容并将“T”替换为您想要的数据类型来摆脱模板。

但是，我不会那样做。看，我们创建一个“String”是多么容易class。我们只是 typedef char 的动态数组作为“字符串”。

using String = DynamicArray<char>;

将为我们提供基本的字符串功能。而如果我们以后想要一个动态的字符串数组，我们可以这样写：

using StringArray = DynamicArray<String>;

这给了我们 DynamicArray<DynamicArray<char>>。酷

对于这种特殊的 classes 我们可以覆盖一些运算符，这将使处理和我们的生活更加简单。

请查看提供的code

并且，为了能够在典型的 C++ 环境中使用容器，我们可以添加完整的迭代器功能。这让生活变得更加简单。

这确实需要一些打字工作，但并不复杂。而且，它会让生活变得更简单。

您还想创建一个哈希映射。用于统计字数。

为此，我们将创建一对 key/value。键是我们上面定义的字符串，值将是频率计数器。

我们实现了一个哈希函数，应该仔细选择它来处理字符串，它具有高熵并且对于哈希映射的桶大小给出了良好的结果。

哈希映射本身是一个动态容器。我们还将为其添加迭代器功能。

为此，我为您起草了大约 700 行代码。可以以此为范例进一步学习。

它还可以通过附加功能轻松增强。

但要注意：我只做了一些基本测试，甚至对自有内存使用了原始指针。这可以在学习一些动态内存管理的学校项目中完成，但在现实中不是。

此外。您可以通过简单地使用 std::string、std::vector 和 std::unordered_map 来替换所有这些代码。没有人会使用这样的代码并重新发明轮子。

但它可能会给你一些关于如何实现类似事情的想法。

由于 Stackoverlof 限制答案大小为 32000 个字符，我将把主要部分放在 github。

请点击here。

我将只向您展示主要内容，以便您了解该解决方案的使用有多么简单：

int main() {

    // Open file and check, if it could be opened
    std::ifstream ifs{ "r:\test.txt" };
    if (ifs) {

        // Define a dynamic array for strings
        StringArray stringArray{};

        // Use overwritten extraction operator and read all strings from the file to the dynamic array
        ifs >> stringArray;

        // Create a dynamic hash map
        HashMap hm{};

        // Now count the frequency of words
        for (const String& s : stringArray)
            hm[s]++;

        // Put the resulting key/value pairs into a dynamic array
        DynamicArray<Item> items(hm.begin(), hm.end());

        // Sort in descending order by the frequency
        std::sort(items.begin(), items.end(), [](const Item& i1, const Item& i2) { return i1.count > i2.count; });

        // SHow resulton screen
        for (const auto& [string, count] : items) 
            std::cout << std::left << std::setw(20) << string << '\t' << count << '\n';
    }
    else std::cerr << "\n\nError: Could not open source file\n\n";
}

Answer 4

您可以使用 std::map 来获取文本文件中每个单词的出现频率。下面给出一个例子供参考：

#include <iostream>
#include <map>
#include <string>
#include <sstream>
#include <fstream>
int main()
{
    std::ifstream inputFile("input.txt");
    std::map<std::string, unsigned> freqMap;
    std::string line, word; 
    if(inputFile)
    {
        while(std::getline(inputFile, line))//go line by line 
        {
            std::istringstream ss(line);
            
            while(ss >> word)//go word by word 
            {
                ++freqMap[word]; //increment the count value corresponding to the word 
            }
        }
    }
    else 
    {
        std::cout << "input file cannot be opened"<<std::endl;
    }
    
    //print the frequency of each word in the file 
    for(auto myPair: freqMap)
    {
        std::cout << myPair.first << ": " << myPair.second << std::endl;
    }
    return 0;
}

上面程序的输出可见here.

从文本文件获取输入并存储到数组中，但文本文件包含超过 20.000 个字符串

Getting input from text file and storing into array but text file contains more than 20.000 strings

c++

arrays

sorting

dynamic-arrays

data-structures