使用 pandas 或枚举在两个数据集之间查找 overlaps/range
find overlaps/range between two datasets using pandas or enumerate
我正在尝试对具有以下条件的两个文件进行一些间隔范围操作
检查 chrom 是否相等,然后检查我的 co0rdinatefile 的开始和结束是否在或等于 gene_annotation 文件的开始和结束(条件如果链是“+”,开始和结束将是例如 10- 20,如果它的“-”将是 20-10) ,如果从坐标和 gene_id、gene_name 从基因注释文件匹配打印开始结束链。 (出于表示目的,我有头部注释文件)
注释文件中的行数~50000
协调文件中的行数 ~200,000
gene_annotation文件
chrom start end gene_id gene_name strand
17 71223692 71274336 ENSMUSG00000085299 Gm16627 -
17 18186448 18211184 ENSMUSG00000067978 Vmn2r-ps113 +
11 84645863 84684319 ENSMUSG00000020530 Ggnbp2 -
7 51097639 51106551 ENSMUSG00000074155 Klk5 +
13 31711037 31712238 ENSMUSG00000087276 Gm11378 +
coordinates_file
chrom start end strand
1 4247322 4247912 -
1 4427449 4432604 +
1 4763414 4764404 -
1 4764597 4767606 -
1 4764597 4766491 -
1 4766882 4767606 -
1 4767729 4772649 -
1 4767729 4768829 -
1 4767729 4775654 -
1 4772382 4772649 -
1 4772814 4774032 -
1 4772814 4774159 -
1 4772814 4775654 -
1 4772814 4774032 +
1 4774186 4775654 -
1 4774186 4775654
1 4774186 4775699 -
期望的输出
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001 abcd
另一个问题是在某些情况下,如果存在匹配项,它可能会映射到 gene_id 在这种情况下,我想写
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh
到目前为止我的代码:
import csv
with open('coordinates.txt', 'r') as source:
coordinates = list(csv.reader(source, delimiter="\t"))
with open('/gene_annotations.txt', 'rU') as source:
#if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
annotations = list(csv.reader(source, delimiter="\t"))
for index,line in enumerate(coordinates):
for index2, line2 in enumerate(annotations):
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
break
我得到的错误
---> 15 if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
16 print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
17 break
TypeError: list indices must be integers, not list
pandas 会是一个好的方法吗?
我假设坐标是列表的列表,例如 [[1,2],[3,4]] 。
该行
for index,line in enumerate(coordinates):
遍历坐标,将每一行坐标返回为线,将索引返回为索引。
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
错误信息的意思是你在这里使用的是一个列表(行)作为索引。您可能想使用索引而不是行:
if coordinates[index][0] == annotations[index2][0] and coordinates[index][1] <= annotations[index2][1] and annotations[index2][2] >= coordinates[index][2] :
更好的方法是只使用行:
if line[0] == line2[0] and line[1] <= line2[1] and line2[2] >= line[2] :
见https://docs.python.org/2.7/reference/compound_stmts.html?highlight=for_stmt#grammar-token-for_stmt
有专门用于此的工具,例如 bedtools
相交;
https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
与 Bedops 相交;
https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html
我正在尝试对具有以下条件的两个文件进行一些间隔范围操作 检查 chrom 是否相等,然后检查我的 co0rdinatefile 的开始和结束是否在或等于 gene_annotation 文件的开始和结束(条件如果链是“+”,开始和结束将是例如 10- 20,如果它的“-”将是 20-10) ,如果从坐标和 gene_id、gene_name 从基因注释文件匹配打印开始结束链。 (出于表示目的,我有头部注释文件)
注释文件中的行数~50000 协调文件中的行数 ~200,000
gene_annotation文件
chrom start end gene_id gene_name strand
17 71223692 71274336 ENSMUSG00000085299 Gm16627 -
17 18186448 18211184 ENSMUSG00000067978 Vmn2r-ps113 +
11 84645863 84684319 ENSMUSG00000020530 Ggnbp2 -
7 51097639 51106551 ENSMUSG00000074155 Klk5 +
13 31711037 31712238 ENSMUSG00000087276 Gm11378 +
coordinates_file
chrom start end strand
1 4247322 4247912 -
1 4427449 4432604 +
1 4763414 4764404 -
1 4764597 4767606 -
1 4764597 4766491 -
1 4766882 4767606 -
1 4767729 4772649 -
1 4767729 4768829 -
1 4767729 4775654 -
1 4772382 4772649 -
1 4772814 4774032 -
1 4772814 4774159 -
1 4772814 4775654 -
1 4772814 4774032 +
1 4774186 4775654 -
1 4774186 4775654
1 4774186 4775699 -
期望的输出
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001 abcd
另一个问题是在某些情况下,如果存在匹配项,它可能会映射到 gene_id 在这种情况下,我想写
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh
到目前为止我的代码:
import csv
with open('coordinates.txt', 'r') as source:
coordinates = list(csv.reader(source, delimiter="\t"))
with open('/gene_annotations.txt', 'rU') as source:
#if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
annotations = list(csv.reader(source, delimiter="\t"))
for index,line in enumerate(coordinates):
for index2, line2 in enumerate(annotations):
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
break
我得到的错误
---> 15 if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
16 print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
17 break
TypeError: list indices must be integers, not list
pandas 会是一个好的方法吗?
我假设坐标是列表的列表,例如 [[1,2],[3,4]] 。 该行
for index,line in enumerate(coordinates):
遍历坐标,将每一行坐标返回为线,将索引返回为索引。
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
错误信息的意思是你在这里使用的是一个列表(行)作为索引。您可能想使用索引而不是行:
if coordinates[index][0] == annotations[index2][0] and coordinates[index][1] <= annotations[index2][1] and annotations[index2][2] >= coordinates[index][2] :
更好的方法是只使用行:
if line[0] == line2[0] and line[1] <= line2[1] and line2[2] >= line[2] :
见https://docs.python.org/2.7/reference/compound_stmts.html?highlight=for_stmt#grammar-token-for_stmt
有专门用于此的工具,例如 bedtools
相交;
https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
与 Bedops 相交;
https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html