字数统计项目的多个哈希表

Question

我已经写了一个工作项目，但我的问题是，它比我最初的目标慢得多，所以我有一些关于如何改进它的想法，但我不知道如何实施这些想法还是我应该首先实际实施这些想法？

我项目的主题是，读取一个充满推文的 CSV (Excel) 文件并计算其中的每个单词，然后显示最常用的单词。
（Excel的每一行都有关于推文和推文本身的信息，我应该只关心推文）

我不会分享整个代码，而是简单地写下我到目前为止所做的，并且只询问我正在努力的部分。
首先，我想道歉，因为这将是一个很长的问题。

重要说明：我唯一应该关注的是速度，存储或大小都不是问题。

All the steps:

Read a new line from Excel file.

Find the "tweet" part from the whole line and store it as a string.

Split the tweet string into words and store it in the array.

For every word stored in an array, calculate the ASCII value of the word.

（为了找到单词的 ascii 值，我简单地将它具有的每个字母的 ascii 值相加）

Put the word in Hash Table with the key of ASCII value.

（例子：Word "hello" 的ascii值是104+101+108+108+111 = 532，所以它在hasttable中用key 532存储）

In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.

我将分享 Insert 函数（用于向 Hashtable 中插入一些内容），因为我相信如果我试图在没有代码的情况下解释这部分会造成混淆。

void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{

    if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.

    if (table[key] != NULL) //If there is already something in hast table 
    {       

        if (table[key]->value == value) //If existing value is same as the value we want to add
        {               
            countArray[key][0]++;
        }
        else //If value is different,
        {           
            Insert(key + 100, value);  //Call this function again with the key 100 more than before.
        }
    }
    else //There is nothing saved in this place so save this value
    {           
        table[key] = new HashEntry(key, value); 
        countArray[key][1] = key;
        countArray[key][0]++;           
    }

}

So "Insert" function has three-part.

Add the value to hash table if hast table with the given key is empty.

If hast table with the given key is not empty that means we already put a word with this ascii value.

因为不同的单词可以有完全相同的 ascii 值。

The program first checks if this is the same word.

If it is, count increase (In the count array).

If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.

读取整行并将每个单词存储在哈希中后table ->

Sort the Count array
Print the first 10 element

程序到此结束，问题是什么？

现在我最大的问题是我正在读取一个包含数千行的非常大的 CSV 文件，所以每一个不必要的东西都会显着增加时间。

我的第二个问题是有很多值具有相同的 ASCII 值，我的方法比正常的 ascii 值方法检查一百多的方法有效，但是？为了找到空的 space 或相同的词，Insert 函数每个词调用自身一百次。
（这引起了最多的问题）。

所以我想到了使用多个Hashtable。
例如，我可以检查单词的第一个字母是否为
在A和E之间，存储在第一个散列table
F和J之间，存入第二个hash table
...
在 V 和 Z 之间，存储在最后一个哈希 table 中。

再次重要说明：我唯一应该关注的是速度，存储或大小都不是问题。

所以冲突应该主要通过这种方式减少。
我什至可以创建一个荒谬数量的散列 tables 并且对于每个不同的字母，我可以使用不同的散列 table.
但我不确定这样做是否合乎逻辑，或者我是否可以使用更简单的方法。

如果可以使用多个散列 tables，而不是一个接一个地创建散列 tables，是否可以创建一个存储散列 table 的数组] 在每一个地方？（与 Array of Arrays 相同，但这次数组存储 Hashtables）如果可行且合乎逻辑，有人可以展示如何去做吗？

这是散列 table 我有：

class HashEntry
{
public:
    int key;
    string value;
    HashEntry(int key, string value)
    {
        this->key = key;
        this->value = value;
    }
};

class HashMap
{
private:
    HashEntry * *table;
public:
    HashMap()
    {
        table = new HashEntry *[TABLE_SIZE];
        for (int i = 0; i < TABLE_SIZE; i++)
        {
            table[i] = NULL;
        }
    }

//Functions

}

非常抱歉我问了这么长的问题，如果我不能把每一部分都解释清楚，我再次非常抱歉，英语不是我的母语。

最后一点，我正在为一个学校项目做这件事，所以我不应该使用任何第三方软件或包含任何不同的库，因为这是不允许的。

Answer 1

您使用了一个非常糟糕的散列函数（添加所有字符），这就是为什么您会遇到如此多的冲突，并且您的 Insert 方法因此多次调用自身。

有关不同哈希函数的详细概述，请参阅 this 问题的答案。我建议您尝试 DJB2 或 FNV-1a（用于 std::unordered_map 的某些实现）。

您还应该使用更本地化的 "probes" 来改进 cache-locality 并在您的 Insert 方法中使用循环而不是递归。

但首先我建议您稍微调整一下 HashEntry：

class HashEntry
{
public:
    string key; // the word is actually a key, no need to store hash value
    size_t value; // the word count is the value.
    HashEntry(string key)
        : key(std::move(key)), value(1) // move the string to avoid unnecessary copying
    { }
};

那我们尝试使用更好的散列函数：

// DJB2 hash-function
size_t Hash(const string &key)
{
    size_t hash = 5381;
    for (auto &&c : key)
        hash = ((hash << 5) + hash) + c;
    return hash;
}

然后重写Insert函数：

void Insert(string key)
{
    size_t index = Hash(key) % TABLE_SIZE;

    while (table[index] != nullptr) {       
        if (table[index]->key == key) {               
            ++table[index]->value;
            return;
        }
        ++index;
        if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
            index = 0;
    }           

    table[index] = new HashEntry(std::move(key));
}

要通过键查找散列 table 条目，您可以使用类似的方法：

HashEntry *Find(const string &key)
{
    size_t index = Hash(key) % TABLE_SIZE;

    while (table[index] != nullptr) {       
        if (table[index]->key == key) {               
            return table[index];
        }
        ++index;
        if (index == TABLE_SIZE)
            index = 0;
    }           

    return nullptr;
}

字数统计项目的多个哈希表

Multiple Hash Tables for the Word Count Project

c++

hashtable

count

word