如何降低文件 IO 程序的时间复杂度？

Question

我编写了这段代码来查找某个单词在 C 文件中出现的次数。该代码工作正常。但肯定需要很多时间。为了计算一个单词在 650MB 大小的文件中出现的次数，需要 151.1 秒，这是很多时间。我想以 80MB/秒的速度处理它。如何提高时间复杂度？非常感谢

#include <ctype.h>
#include <stdlib.h>
#include <string.h>
int main(){
    FILE *fptr;
    int l,i=0,count=0,total=0;
    char name[100],n,word[25],k;
    printf("\nEnter the word to be found:");
    scanf("%s",word);
    l=strlen(word);
    printf("\nEnter the file name:");
    scanf("%s",name);
    fptr=fopen(name,"r");
    if(fptr==NULL){
        printf("\nProblem with opening the file");
        exit(1);
    }
    n=fgetc(fptr);
    while((feof(fptr)==0)){
        if(n==toupper(word[i])||n==tolower(word[i])){
            count++;
            i++;
        }
        else if(n!=word[i]){
            if(count>1){
                fseek(fptr, -count, SEEK_CUR);
            }
            count=0;
            i=0;
        }
        if(count==l){
            total++;
            count=0;
            i=0;
        }
        n=fgetc(fptr);
    }
    if(total==0){
        printf("\nThe word %s does not exist in the file",word);
    }
    printf("\nThe word %s occurred %d time(s) in the file",word,total);
}

Answer 1

一次读取更大的缓冲区。 fgetc() 用于一次读取一个字节，这是您可以读取的最小数量，因此您正在最大化读取文件所需的步骤数。每个读取操作都有一些开销。（每个 fgetc 调用不一定会导致实际从磁盘读取——在幕后发生了一些缓存和预读。）因此您进行的调用越少，程序处理相同内容的次数就越少数据量。

从技术上讲，大批量读取不会降低“时间复杂度”。就文件大小而言，它仍然大致呈线性关系，因此它属于同一类别的复杂性。它只会快很多，这才是您真正关心的。

此外，我知道您只是为了提问而展示了简短的示例代码，但您正在使用不安全的 scanf() 调用读取固定大小的缓冲区“word”和“name”。由于单词只有 25 个字节长，如果用户输入 26 个字符长的单词，他们可能会崩溃或利用您的程序。

Answer 2

您的程序也可能受到某种形式的 I/O 放大，它一遍又一遍地重新读取相同的数据。

这是您的主要文件读取循环：

n=fgetc(fptr);
while((feof(fptr)==0)){
    if(n==toupper(word[i])||n==tolower(word[i])){
        count++;
        i++;
    }
    else if(n!=word[i]){
        if(count>1){
            fseek(fptr, -count, SEEK_CUR);
        }
        count=0;
        i=0;
    }
    if(count==l){
        total++;
        count=0;
        i=0;
    }
    n=fgetc(fptr);
}

将其减少到仅 I/O 个调用：

n=fgetc(fptr);
while((feof(fptr)==0)){
    if(n!=word[i]){
        if(count>1){
            fseek(fptr, -count, SEEK_CUR);
        }
        count=0;
        i=0;
    }

    n=fgetc(fptr);
}

发生了什么：

您以只读模式打开文件
由于文件是缓冲的，当您第一次调用 fgetc() 时，您的程序实际上是从文件的当前偏移量 读取文件并填满其缓冲区 。这意味着您的程序可以立即读取多达几 kB（通常为 4kB 或 8kB，具体取决于您的系统）。
您的程序循环调用 fgetc()，每个 return 一个 char 值（保存在 int 中）到您的代码。大多数时候，char 只是从与 fptr 关联的缓冲区中复制而来。
您的程序调用 fseek()。该调用 使缓冲数据无效 。
在您下次调用 fgetc() 时，您的程序 再次填满其缓冲区，大部分时间重新读取已读取的数据。

根据您的程序调用 fseek() 的频率，您的程序读取的数据可能比文件中实际包含的数据多几百到几千倍。

它并没有看起来那么糟糕，因为大多数读取都希望不会从磁盘一直读取，但系统 page cache 会满足。但是每个 fseek() 调用都会导致无关的上下文切换，连同使用 fgetc() 一次读取 char 的所有额外调用，可能会减慢您的程序

简单地用 fread() 之类的东西读取大块数据就可以了，但是因为你在数据流中“备份”（你的 fseek() 调用），你必须考虑到这种可能性“备份”到 以前的 数据块。

要可靠地做到这一点有点困难和乏味。

如果单词不连续跨行，最简单的解决方案是使用 fgets()（或 POSIX 系统上的 getline()）逐行阅读：

for (;;)
{
    // define MAX_LINE_LENGTH to a suitable value
    char line[ MAX_LINE_LENGTH ];

    char *result = fgets( line, sizeof( line ), fp );

    // EOF (or error - either way there's no more data to be read)
    if ( result == NULL )
    {
        break;
    }

    // remove newline (if you want)
    line[ strcspn( line, "\n" ) ] = '[=12=]';

    // now process a line of text
        .
        .
        .
}

逐行阅读还允许使用 strtok() 等标准函数将输入拆分为单独的词，然后使用 strncasecmp() 查找与您要查找的词不区分大小写的匹配项正在寻找。

如何降低文件 IO 程序的时间复杂度？

How do I reduce the time complexity of a file IO program?

c

file-io