使用 mapreduce 从日志文件中提取命中计数

Hit Count extraction from log files using mapreduce

我正在尝试在 Hadoop map-reduce 中编写以下代码。我有一个日志文件,其中包含 IP 地址和相应 IP 打开的 url。具体如下:

192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com

现在我需要组织此文件的结果,以便它列出不同的 IP 地址和 Urls,后跟特定 IP 打开的次数。

例如,如果 192.168.72.224 根据整个日志文件打开 www.yahoo.com 15 次,则输出必须包含:

192.168.72.224 www.yahoo.com 15

应对文件中的所有 IP 执行此操作,最终输出应如下所示:

192.168.72.224 www.yahoo.com 15
               www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
               www.gmail.com 19
....
...
..
.

我试过的代码是:

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
            private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
                 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
                       String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

                     while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
                              output.collect(word, one);
            }
       }
}

我知道这段代码存在严重缺陷,请给我一个继续前进的想法。

谢谢。

我会推荐这个设计:

  1. Mapper 从文件中获取一行并输出 IP 作为键和一对网站和 1 作为值
  2. 组合器和减速器。获取 IP 作为键和一系列 (website, count) 对,按网站聚合它们(使用 HashMap)并输出 IP、网站和计数作为输出。

实现这个需要你实现自定义可写来处理一对 .

就我个人而言,除非您太在意性能,否则我会使用 Spark 来执行此操作。使用 PySpark,它会像这样简单:

rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
    print 'IP: %s' % x[0]
    for w in x[1]:
        print '    website: %s count: %d' % (w[0], w[1])

您的示例的输出为:

IP: 192.168.72.224
    website: www.facebook.com count: 2
    website: www.m4maths.com count: 2
    website: www.google.com count: 5
    website: www.gmail.com count: 4
    website: www.indiabix.com count: 8
    website: www.yahoo.com count: 3
IP: 192.168.72.177
    website: www.yahoo.com count: 14
    website: www.google.com count: 3
    website: www.facebook.com count: 3
    website: www.m4maths.com count: 3
    website: www.indiabix.com count: 1
IP: 192.168.198.92
    website: www.facebook.com count: 4
    website: www.m4maths.com count: 3
    website: www.yahoo.com count: 3
    website: www.askubuntu.com count: 2
    website: www.indiabix.com count: 1
    website: www.google.com count: 5
    website: www.gmail.com count: 1

我在java

中写了同样的逻辑
public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{

    public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {

        System.out.println(value);
        StringTokenizer st=new StringTokenizer(value.toString());

        if(st.hasMoreTokens())
            contex.write(new Text(st.nextToken()), new Text(st.nextToken()));

    }
}

public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        HashMap<String, Integer> urlCount=new HashMap<>();
        String url=null;

        Iterator<Text> it=values.iterator();

        while (it.hasNext()) {

            url=it.next().toString();

            if(urlCount.get(url)==null)
                urlCount.put(url, 1);
            else
                urlCount.put(url,urlCount.get(url)+1);
        }

        for(Entry<String, Integer> k:urlCount.entrySet())
        context.write(key, new Text(k.getKey()+"    "+k.getValue()));
    }
}

public class UrlHitCount extends Configured implements Tool {

    public static void main(String[] args) throws Exception {

        ToolRunner.run(new Configuration(), new UrlHitCount(), args);
    }

    public int run(String[] arg0) throws Exception {


        Job job = Job.getInstance(getConf());

        job.setJobName("url-hit-count");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(UrlHitMapper.class);

        job.setReducerClass(UrlHitReducer.class);  

        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("input/urls"));
        FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));

        job.setJarByClass(WordCount.class);
        job.submit();

        return 1;
    }

}