以一种方式编辑大型输入流的第一行,并以不同方式编辑所有其他行的最有效方法?
Most efficient way to edit the first line of a large input stream one manner, and all other lines a different manner?
问题... (N=2*10^7)
从这里开始:
colName1 colName2 colName3 ... colNameN
1 x x ... x
2 x x ... x
1 y x ... x
2 y x ... x
... ... ... ... ...
1 xx xx ... xx
2 xx xx ... xx
对此:
Sample colName1 colName2 colName3 ... colNameN
A 1 x x ... x
A 2 x x ... x
B 1 y x ... x
B 2 y x ... x
... ... ... ... ... ...
N 1 xx xx ... xx
N 2 xx xx ... xx
问题:
我需要将 "Sample" 添加到第一个 "header" 行,并将相应的示例名称添加到之后的每一行。样本名称将存储在 object.
中
混淆问题:
- 数据来自输入流;目前通过 subprocess.PIPE
处理
- 文件有 2000 万行很常见,所以每次检查 firstLine 标志会很昂贵吗?
我想知道是否有办法只对输入流中的第一行输入做一些事情。
或者...
尝试所有行相同会不会更容易,这意味着我们将示例名称添加到 header 行。然后,我们将文件中的第一个单词从样本名称编辑为 "Sample\t"
这种方法的成本如何?
目前,我有一个 firstLine 标志,如下所示。
fileSTREAM = subprocess.Popen(callString, stdout=subprocess.PIPE, shell=True)
# To indicate the first line of the steam, which happens to be the column-headers.
firstLine = True
# Foreach to add a word to the front of each line of input.
for line in fileSTREAM.stdout:
# Decode the input from btye literals to strings.
currLine = line.decode("utf-8")
# First line is different, we want to add SAMPLE, instead of the actual sample name.
if firstLine == True:
outputTARGET.write("SAMPLE \t%s" % currLine)
firstLine = False
# All other lines we want to add the sample name, instead of the word SAMPLE.
else:
outputTARGET.write(str(wildcards.samples) + "\t%s" % currLine)
可能不是 python 特定问题,但我正在寻找 python 特定解决方案。
大声喊叫@Prune,谢谢你:)
最好的方法是读取输入流的第一行。 Python 有很好的内置函数来处理这个问题。
最后用这个:
# Call the function and capture its output to modify each line.
fileSTREAM = subprocess.Popen(callString, stdout=subprocess.PIPE, shell=True)
# Initially read and edit just the first, adding 'SAMPLE' to header line.
outputTARGET.write("SAMPLE \t%s" % fileSTREAM.stdout.readline().decode("utf-8"))
# Add the sampleName to each line after the header line.
for line in fileSTREAM.stdout:
# Decode the input from btye literals to strings
outputTARGET.write(str(wildcards.samples) + "\t%s" % line.decode("utf-8"))
问题... (N=2*10^7)
从这里开始:
colName1 colName2 colName3 ... colNameN
1 x x ... x
2 x x ... x
1 y x ... x
2 y x ... x
... ... ... ... ...
1 xx xx ... xx
2 xx xx ... xx
对此:
Sample colName1 colName2 colName3 ... colNameN
A 1 x x ... x
A 2 x x ... x
B 1 y x ... x
B 2 y x ... x
... ... ... ... ... ...
N 1 xx xx ... xx
N 2 xx xx ... xx
问题: 我需要将 "Sample" 添加到第一个 "header" 行,并将相应的示例名称添加到之后的每一行。样本名称将存储在 object.
中混淆问题:
- 数据来自输入流;目前通过 subprocess.PIPE 处理
- 文件有 2000 万行很常见,所以每次检查 firstLine 标志会很昂贵吗?
我想知道是否有办法只对输入流中的第一行输入做一些事情。
或者...
尝试所有行相同会不会更容易,这意味着我们将示例名称添加到 header 行。然后,我们将文件中的第一个单词从样本名称编辑为 "Sample\t"
这种方法的成本如何? 目前,我有一个 firstLine 标志,如下所示。
fileSTREAM = subprocess.Popen(callString, stdout=subprocess.PIPE, shell=True)
# To indicate the first line of the steam, which happens to be the column-headers.
firstLine = True
# Foreach to add a word to the front of each line of input.
for line in fileSTREAM.stdout:
# Decode the input from btye literals to strings.
currLine = line.decode("utf-8")
# First line is different, we want to add SAMPLE, instead of the actual sample name.
if firstLine == True:
outputTARGET.write("SAMPLE \t%s" % currLine)
firstLine = False
# All other lines we want to add the sample name, instead of the word SAMPLE.
else:
outputTARGET.write(str(wildcards.samples) + "\t%s" % currLine)
可能不是 python 特定问题,但我正在寻找 python 特定解决方案。
大声喊叫@Prune,谢谢你:)
最好的方法是读取输入流的第一行。 Python 有很好的内置函数来处理这个问题。
最后用这个:
# Call the function and capture its output to modify each line.
fileSTREAM = subprocess.Popen(callString, stdout=subprocess.PIPE, shell=True)
# Initially read and edit just the first, adding 'SAMPLE' to header line.
outputTARGET.write("SAMPLE \t%s" % fileSTREAM.stdout.readline().decode("utf-8"))
# Add the sampleName to each line after the header line.
for line in fileSTREAM.stdout:
# Decode the input from btye literals to strings
outputTARGET.write(str(wildcards.samples) + "\t%s" % line.decode("utf-8"))