Hadoop 错误流
Hadoop ERROR streaming
我有一个映射器:
join1_mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
key_value = line.split(",")
key_in = key_value[0].split(" ")
value_in = key_value[1]
if len(key_in)>=2:
date = key_in[0]
word = key_in[1]
value_out = date+" "+value_in
print( '%s\t%s' % (word, value_out) )
else:
print( '%s\t%s' % (key_in[0], value_in) )
我有这个减速器:
import sys
prev_word = " "
months = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output = []
day_cnts_to_output = []
line_cnt = 0
for line in sys.stdin:
line = line.strip()
key_value = line.split('\t')
line_cnt = line_cnt+1
curr_word = key_value[0]
value_in = key_value[1]
if curr_word != prev_word:
if line_cnt>1:
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
dates_to_output =[]
day_cnts_to_output=[]
prev_word =curr_word
if (value_in[0:3] in months):
date_day =value_in.split()
dates_to_output.append(date_day[0])
day_cnts_to_output.append(date_day[1])
else:
curr_word_total_cnt = value_in
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
当我 运行 这份工作时:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
我收到错误:
ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
日志的第一部分说:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob7178107162745054499.jar tmpDir=null15/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:43 INFO mapred.FileInputFormat: Total input paths to process : 415/11/13 02:03:43 INFO mapreduce.JobSubmitter: number of splits:515/11/13 02:03:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1445251653083_001315/11/13 02:03:44 INFO impl.YarnClientImpl: Submitted application application_1445251653083_001315/11/13 02:03:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/15/11/13 02:03:44 INFO mapreduce.Job: Running job: job_1445251653083_001315/11/13 02:03:53 INFO mapreduce.Job: Job job_1445251653083_0013 running in uber mode : false15/11/13 02:03:53 INFO mapreduce.Job: map 0% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: map 40% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000002_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000003_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at
我在 url 中搜索过:http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/ 任何帮助,但对我来说,我不清楚我应该做什么。我不明白错误在哪里。有人可以帮我吗?
我已经解决了。来自 HDFS 的 INPUT 目录中必须只有来自计算的 TXT 文件。就我而言,我还有其他文件。我创建了另一个目录。在我将 TXT 文件发送回新目录后。我在新的HDFS-INPUT目录下再次运行程序。现在它起作用了。
我有一个映射器:
join1_mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
key_value = line.split(",")
key_in = key_value[0].split(" ")
value_in = key_value[1]
if len(key_in)>=2:
date = key_in[0]
word = key_in[1]
value_out = date+" "+value_in
print( '%s\t%s' % (word, value_out) )
else:
print( '%s\t%s' % (key_in[0], value_in) )
我有这个减速器:
import sys
prev_word = " "
months = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output = []
day_cnts_to_output = []
line_cnt = 0
for line in sys.stdin:
line = line.strip()
key_value = line.split('\t')
line_cnt = line_cnt+1
curr_word = key_value[0]
value_in = key_value[1]
if curr_word != prev_word:
if line_cnt>1:
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
dates_to_output =[]
day_cnts_to_output=[]
prev_word =curr_word
if (value_in[0:3] in months):
date_day =value_in.split()
dates_to_output.append(date_day[0])
day_cnts_to_output.append(date_day[1])
else:
curr_word_total_cnt = value_in
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
当我 运行 这份工作时:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
我收到错误:
ERROR streaming.StreamJob: Job not successful! Streaming Command Failed!
日志的第一部分说:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob7178107162745054499.jar tmpDir=null15/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:43 INFO mapred.FileInputFormat: Total input paths to process : 415/11/13 02:03:43 INFO mapreduce.JobSubmitter: number of splits:515/11/13 02:03:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1445251653083_001315/11/13 02:03:44 INFO impl.YarnClientImpl: Submitted application application_1445251653083_001315/11/13 02:03:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/15/11/13 02:03:44 INFO mapreduce.Job: Running job: job_1445251653083_001315/11/13 02:03:53 INFO mapreduce.Job: Job job_1445251653083_0013 running in uber mode : false15/11/13 02:03:53 INFO mapreduce.Job: map 0% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: map 40% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000002_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000003_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at
我在 url 中搜索过:http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/ 任何帮助,但对我来说,我不清楚我应该做什么。我不明白错误在哪里。有人可以帮我吗?
我已经解决了。来自 HDFS 的 INPUT 目录中必须只有来自计算的 TXT 文件。就我而言,我还有其他文件。我创建了另一个目录。在我将 TXT 文件发送回新目录后。我在新的HDFS-INPUT目录下再次运行程序。现在它起作用了。