hadoop中的输入拆分和块
Input split and block in hadoop
我的文件大小为 100 MB,默认块大小为 64 MB。如果我不设置输入拆分大小,默认拆分大小将是块大小。现在拆分大小也是 64 MB。
当我将这个 100 MB 的文件加载到 HDFS 时,这个 100 MB 的文件将分成 2 个块。即 64 MB 和 36 MB。例如下面是一首 100 MB 大小的歌词。如果我将这些数据加载到 HDFS 中,那么从第 1 行到第 16 行的一半恰好是 64 MB 作为一个 split/block(最多 "It made the")和其余第 16 行的一半(children laugh and play)到文件末尾作为第二块 (36 MB)。将有两个映射器作业。
我的问题是第一个映射器如何考虑第 16 行(即块 1 的第 16 行),因为该块只有一半的行,或者第二个映射器如何考虑块 2 的第一行,因为它也有一半的线。
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go
He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school
And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear
或者在拆分 64 MB 时,hadoop 会考虑整行 16,而不是拆分单行吗?
第一个映射器将读取整个第 16 行(它将继续读取直到找到行尾字符)。
如果您还记得,为了应用 mapreduce,您的输入必须按键值对组织。对于 TextInputFormat,这恰好是 Hadoop 中的默认值,这些对是:(offset_from_file_beginning, line_of_text)。文本被分解为基于“\n”字符的键值对。因此,如果一行文本超出输入拆分的大小,映射器将继续读取直到找到 '\n'。
在 hadoop 中,数据是根据输入拆分大小和块大小读取的。
文件根据大小分成多个FileSplits。每个输入拆分都使用与输入中的偏移量对应的起始参数进行初始化。
当我们初始化 LineRecordReader 时,它会尝试实例化一个开始读取行的 LineReader。
如果定义了 CompressionCodec,它会处理边界。
所以如果InputSplit的开头不为0,则回溯1个字符,然后跳过第一行,(遇到\n或\r\n)回溯确保不跳过有效行
代码如下:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
由于拆分是在客户端计算的,映射器不需要按顺序运行,每个映射器都知道是否需要丢弃第一行。
因此,对于您的情况,第一个块 B1 将从偏移量 0 读取数据到 "It made the children laugh and play" 行
块B2将从"To see a lamb at school"行读取数据到最后一行偏移量。
你可以参考这些:
https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/
How does Hadoop process records split across block boundaries?
我的文件大小为 100 MB,默认块大小为 64 MB。如果我不设置输入拆分大小,默认拆分大小将是块大小。现在拆分大小也是 64 MB。
当我将这个 100 MB 的文件加载到 HDFS 时,这个 100 MB 的文件将分成 2 个块。即 64 MB 和 36 MB。例如下面是一首 100 MB 大小的歌词。如果我将这些数据加载到 HDFS 中,那么从第 1 行到第 16 行的一半恰好是 64 MB 作为一个 split/block(最多 "It made the")和其余第 16 行的一半(children laugh and play)到文件末尾作为第二块 (36 MB)。将有两个映射器作业。
我的问题是第一个映射器如何考虑第 16 行(即块 1 的第 16 行),因为该块只有一半的行,或者第二个映射器如何考虑块 2 的第一行,因为它也有一半的线。
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go
He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school
And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear
或者在拆分 64 MB 时,hadoop 会考虑整行 16,而不是拆分单行吗?
第一个映射器将读取整个第 16 行(它将继续读取直到找到行尾字符)。
如果您还记得,为了应用 mapreduce,您的输入必须按键值对组织。对于 TextInputFormat,这恰好是 Hadoop 中的默认值,这些对是:(offset_from_file_beginning, line_of_text)。文本被分解为基于“\n”字符的键值对。因此,如果一行文本超出输入拆分的大小,映射器将继续读取直到找到 '\n'。
在 hadoop 中,数据是根据输入拆分大小和块大小读取的。
文件根据大小分成多个FileSplits。每个输入拆分都使用与输入中的偏移量对应的起始参数进行初始化。
当我们初始化 LineRecordReader 时,它会尝试实例化一个开始读取行的 LineReader。
如果定义了 CompressionCodec,它会处理边界。 所以如果InputSplit的开头不为0,则回溯1个字符,然后跳过第一行,(遇到\n或\r\n)回溯确保不跳过有效行
代码如下:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
由于拆分是在客户端计算的,映射器不需要按顺序运行,每个映射器都知道是否需要丢弃第一行。
因此,对于您的情况,第一个块 B1 将从偏移量 0 读取数据到 "It made the children laugh and play" 行
块B2将从"To see a lamb at school"行读取数据到最后一行偏移量。
你可以参考这些:
https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/
How does Hadoop process records split across block boundaries?