使用线程计算文件中单词出现次数的程序中的分段错误

Question

所以我遇到了以下问题：实现一个程序，该程序将文件名后跟单词作为参数。对于每个单词，创建一个单独的线程，计算它在给定 file.Print 中出现的所有单词的总和。

我的代码是：

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <unistd.h>
#include <pthread.h>

pthread_mutex_t mtx; // used by each of the three threads to prevent  other threads from accessing global_sum during their additions

int global_sum = 0;
typedef struct{
                    char* word;
                    char* filename;
}MyStruct;



void *count(void*str)
{
    MyStruct *struc;
    struc = (MyStruct*)str; 
    const char *myfile = struc->filename;

    FILE *f;
    int count=0, j;
    char buf[50], read[100];
    // myfile[strlen(myfile)-1]='[=10=]';
    if(!(f=fopen(myfile,"rt"))){
         printf("Wrong file name");
    }
    else
         printf("File opened successfully\n");
         for(j=0; fgets(read, 10, f)!=NULL; j++){
             if (strcmp(read[j],struc->word)==0)
                count++;
         }

    printf("the no of words is: %d \n",count);  
    pthread_mutex_lock(&mtx); // lock the mutex, to prevent other threads from accessing global_sum
    global_sum += count; // add thread's count result to global_sum
    pthread_mutex_unlock(&mtx); // unlock the mutex, to allow other threads to access the variable
}


int main(int argc, char* argv[]) {
    int i;
    MyStruct str; 

    pthread_mutex_init(&mtx, NULL); // initialize mutex
    pthread_t threads[argc-1]; // declare threads array 

    for (i=0;i<argc-2;i++){

       str.filename = argv[1];  
       str.word = argv[i+2];

       pthread_create(&threads[i], NULL, count, &str); 
    }

    for (i = 0; i < argc-1; ++i)
         pthread_join(threads[i], NULL);

    printf("The global sum is %d.\n", global_sum); // print global sum

    pthread_mutex_destroy(&mtx); // destroy the mutex

    return 0;

}

当我尝试运行它时，我得到了分段错误。这是为什么？谢谢！

Answer 1

在 main() 中，您的两个 i 循环不同

for (i=0;i<argc-2;i++){
    ...
    pthread_create(&threads[i], NULL, count, &str); 
}

然后

for (i = 0; i < argc-1; ++i)
    pthread_join(threads[i], NULL);

并且在第二个循环中，您引用的 threads[argc-2] 不是在第一个循环中创建的。

Answer 2

一次（可能）阅读（最多）10 个字符会漏掉正在搜索的单词的某些实例。

strcmp() 始终从 10 个字符的开头开始。

1) 需要在文件的任意位置查找目标词。

2) 需要在读入缓冲区的任意位置寻找目标词。

建议：

0) clear input buffer
1) input one char at a time, 
accumulating characters in the input buffer,
2) when a word separator found, (for instance a space or EOF)
3) then check if the word matches the target word.  
4) if matches, increment count.   
5) if EOF, then exit, else goto 0

Answer 3

首先，您的代码格式非常糟糕。它甚至不一致。它也没有显示您正在编译启用警告。

如果你是大学课程，他们没有告诉你如何格式化代码和编译时出现警告，我强烈建议你问问你的导师。

如果使用 gcc，请添加 -Wall -Wextra。对于编码风格，我建议从 Linux or FreeBSD 中窃取一个。有各种编辑器可以为您格式化代码，包括像 vim 这样的真正的编辑器（值得一试，尽管它看起来很苛刻）。

您的编码风格可以帮助您克服困难。

void *count(void*str)
{
    MyStruct *struc;
    struc = (MyStruct*)str;
    const char *myfile = struc->filename;

    FILE *f;
    int count=0, j;
    char buf[50], read[100];

buf 未使用，如果您启用了警告，您将了解到这一点。 read 是个坏名字。

    // myfile[strlen(myfile)-1]='[=11=]';
    if(!(f=fopen(myfile,"rt"))){
         printf("Wrong file name");
    }
    else

因为您没有 return（您应该有），所以您正在为执行不应该执行的代码而搞砸。猜不到，您的 'else' 子句无效。您缺少大括号，因此即使文件打开操作失败，也会执行下面的 for 循环。

         printf("File opened successfully\n");
         for(j=0; fgets(read, 10, f)!=NULL; j++){

10？似乎是打字错误，因为您可能是说 100。如果您使用 sizeof，就不会发生这种情况。

             if (strcmp(read[j],struc->word)==0)
                count++;
         }

不清楚你在这里做什么。似乎您想从 read[0]、read1 等开始执行 strcmp。但是你读取替换原始缓冲区中的内容的新数据，然后然后你将它向前推进一个。这毫无意义。最后，无论如何你做错了。 read[j] 不计算地址，如果您要求，编译器会再一次告诉您。

strcmp 方法无论如何都非常糟糕。尝试一种尝试匹配第一个字符并从那里开始工作的方法。

int main(int argc, char* argv[]) {

标准错位“*”。请改用 char *argv[]。

    int i;
    MyStruct str;

    pthread_mutex_init(&mtx, NULL); // initialize mutex
    pthread_t threads[argc-1]; // declare threads array

强烈不推荐。首先验证参数，然后有一个专用变量来保存一定数量的线程。此时就可以分配一个数组了。

    for (i=0;i<argc-2;i++){

       str.filename = argv[1];
       str.word = argv[i+2];

       pthread_create(&threads[i], NULL, count, &str);
    }

与线程类似，将路径保存在某处。将其称为 argv1 是一种糟糕的风格，它会反过来咬你一口。单词使用 argv 就可以了。

然而，这通常是错误的。您设置一个本地结构并将其传递给一个线程，然后立即更改它。所以发生的事情是在一天结束时你所有的线程都在计算同一个词。但是他们数的单词一直在变化。

    for (i = 0; i < argc-1; ++i)
         pthread_join(threads[i], NULL);

去图吧。您没有保存一定数量线程的变量，这导致了这种不一致（argc - 1 与 argc - 2）。

一般来说，这个问题在很大程度上可以通过正确阅读编译器警告来解决，如果采用基本的良好做法，通常可以避免。

当然，无论如何都会发生错误，在这种情况下，您至少可以缩小范围。

最后，关于一般方法的几句话。目前还不清楚这次演习的目的是什么。你实际上必须强迫自己使用 pthread_create 和 pthread_join 以外的任何东西。假设唯一的要求是使用线程。

我不知道他们是强制您多次打开文件还是什么。多次打开和阅读内容不仅浪费，而且会导致文件被替换并且某些线程打开另一个文件的情况。

一个 OK 的解决方案是在 main 中打开文件一次。打开后，您将映射文件和 fstat 以获取大小。如果由于某种原因你不能使用 mmap，你会 malloc 一个足够大的缓冲区并读取文件。

然后所有线程都可以获得该缓冲区的地址、要查找的字以及它们应将计数器存储到的地址（每个线程获得不同的地址）。

当所有线程都退出时，您循环对结果求和。

两种方式都不涉及锁定。

使用线程计算文件中单词出现次数的程序中的分段错误

Segmentation fault error in a program for counting no of occurences of a word in a file using threads

c

linux

multithreading

file

pthreads