在 MapReduce 中处理用户输入字符串
Manipulating a user input string in MapReduce
我开始使用 MapReduce 的 Hadoop 变体,因此对来龙去脉一无所知。我理解它在概念上应该如何工作。
我的问题是在我提供的一堆文件中找到特定的搜索字符串。我对这些文件不感兴趣 - 已排序。但是你会如何征求意见呢?您会在程序的 JobConf 部分提问吗?如果是这样,我如何将字符串传递到作业中?
如果它在 map()
函数中,您将如何实现它?每次调用 map()
函数时,它不会只要求搜索字符串吗?
下面是主要方法和 JobConf()
部分,您应该可以了解一下:
public static void main(String[] args) throws IOException {
// This produces an output file in which each line contains a separate word followed by
// the total number of occurrences of that word in all the input files.
JobConf job = new JobConf();
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
// Output from reducer maps words to counts.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// The output of the mapper is a map from words (including duplicates) to the value 1.
job.setMapperClass(InputMapper.class);
// The output of the reducer is a map from unique words to their total counts.
job.setReducerClass(CountWordsReducer.class);
JobClient.runJob(job);
}
和 map()
函数:
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
// The key is the character offset within the file of the start of the line, ignored.
// The value is a line from the file.
//This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
String inputString = "data";
String line = value.toString();
Scanner scanner = new Scanner(line);
while (scanner.hasNext()) {
if (line.contains(inputString)) {
String line1 = scanner.next();
output.collect(new Text(line1), new LongWritable(1));
}
}
scanner.close();
}
我被引导相信我不需要 reducer 阶段来解决这个问题。非常感谢advice/explanations!
JobConf
class is an extension of Configuration
class,因此,您可以设置自定义属性:
JobConf job = new JobConf();
job.set("inputString", "data");
...
然后,如 Mapper 的文档所述:Mapper 实现可以通过 JobConfigurable.configure(JobConf) 访问作业的 JobConf 并初始化它们自己。 这意味着您必须在 Mapper 中重新实现这样的方法才能获得所需的参数:
private static String inputString;
public void configure(JobConf job)
inputString = job.get("inputString");
}
总之,这是在用旧的API。使用新的配置更容易访问配置,因为上下文(以及配置)作为参数传递给 map
方法。
我开始使用 MapReduce 的 Hadoop 变体,因此对来龙去脉一无所知。我理解它在概念上应该如何工作。
我的问题是在我提供的一堆文件中找到特定的搜索字符串。我对这些文件不感兴趣 - 已排序。但是你会如何征求意见呢?您会在程序的 JobConf 部分提问吗?如果是这样,我如何将字符串传递到作业中?
如果它在 map()
函数中,您将如何实现它?每次调用 map()
函数时,它不会只要求搜索字符串吗?
下面是主要方法和 JobConf()
部分,您应该可以了解一下:
public static void main(String[] args) throws IOException {
// This produces an output file in which each line contains a separate word followed by
// the total number of occurrences of that word in all the input files.
JobConf job = new JobConf();
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
// Output from reducer maps words to counts.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// The output of the mapper is a map from words (including duplicates) to the value 1.
job.setMapperClass(InputMapper.class);
// The output of the reducer is a map from unique words to their total counts.
job.setReducerClass(CountWordsReducer.class);
JobClient.runJob(job);
}
和 map()
函数:
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
// The key is the character offset within the file of the start of the line, ignored.
// The value is a line from the file.
//This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
String inputString = "data";
String line = value.toString();
Scanner scanner = new Scanner(line);
while (scanner.hasNext()) {
if (line.contains(inputString)) {
String line1 = scanner.next();
output.collect(new Text(line1), new LongWritable(1));
}
}
scanner.close();
}
我被引导相信我不需要 reducer 阶段来解决这个问题。非常感谢advice/explanations!
JobConf
class is an extension of Configuration
class,因此,您可以设置自定义属性:
JobConf job = new JobConf();
job.set("inputString", "data");
...
然后,如 Mapper 的文档所述:Mapper 实现可以通过 JobConfigurable.configure(JobConf) 访问作业的 JobConf 并初始化它们自己。 这意味着您必须在 Mapper 中重新实现这样的方法才能获得所需的参数:
private static String inputString;
public void configure(JobConf job)
inputString = job.get("inputString");
}
总之,这是在用旧的API。使用新的配置更容易访问配置,因为上下文(以及配置)作为参数传递给 map
方法。