pandas 如果条件为真,则计数唯一
pandas count unique if condition true
我正在尝试编写一个 python 脚本来从以下数据帧生成计数。我在 excel 中使用 countifs,但 'Sample' 和 'Region' 中的重复项导致使用 countifs 出现问题。
示例输入 df:
Sample Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap
101 chr1 198661465 198661475 NM_002838_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 198661465 198661475 NM_001267798_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 198661465 198661475 NM_080921_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 236966727 236966942 NM_000254_MTR_cds_2 215 + chr1 236966742 236966743 11 1
101 chr1 236966727 236966942 NM_001291939_MTR_cds_2 215 + chr1 236966742 236966743 11 1
101 chr1 236966742 236966942 NM_001291940_MTR_5utr_2 200 + chr1 236966742 236966743 11 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979846 236979847 9 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979847 236979848 8 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979848 236979852 7 4
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979852 236979854 6 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979846 236979847 9 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979847 236979848 8 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979848 236979852 7 4
因此,单个样本可以多次列出相同的 'Region'(不同的坐标,但这与计数无关紧要)。
所需输出 1 - 如果 'Region' 包含 "utr" 或 "intron" 或 "cds",则按 'Sample' 计数, 占重复 'Region' 每 'Sample':
Sample Total Intron UTR CDS
101 68 40 13 15
102 64 38 13 13
期望输出 2 - 如果 'Region' 包含 "utr" 或 "intron",则 'Sample' 的 'Overlap' 之和或 "cds":
Sample Total Intron UTR CDS
101 2838 321 1433 1084
102 2524 291 1449 784
所需的输出 3 - 'Region' 的列表,其中包含列出 'Region' 的样本数
Region Num Samples
ENST00000390559_IGHM_cds_4 2
ENST00000390559_IGMH_cds_1 2
ENST00000390559_IGMH_cds_2 2
ENST00000390559_IGMH_cds_3 12
ENST00000390559_IGMH_intron_1_L 2
ENST00000390559_IGMH_intron_1_R 2
ENST00000390559_IGMH_intron_2_L 10
编辑:
我已经弄清楚如何获得输出#3:
df.groupby('Region').Sample.nunique()
我可以得到输出#1 的总计:
df.groupby('Sample').Region.nunique()
现在我只需要弄清楚如何过滤我的组以包含 'utr/cds/intron' 并对过滤组的 'Overlap' 求和。
如果有人 运行 遇到类似的问题,这就是我想出的生成所描述的三个输出的方法。它可能不是最优雅的解决方案,但它确实有效!
import pandas as pd
import argparse
import os
import sys
#arguments
parser = argparse.ArgumentParser(description="Generate counts by sample and total bases by sample of low coverage regions")
parser.add_argument("-i", "--input", help="input filename", required=True)
parser.add_argument("-o", "--output", help="output basename", required=True)
args = parser.parse_args()
#output filenames
region_count_file = args.output + "_region_count.txt"
bases_count_file = args.output + "_bases_count.txt"
sample_count_file = args.output + "_sample_count.txt"
#read in
df = pd.read_table(args.input)
#check output doesn't exist
if os.path.exists(region_count_file) or os.path.exists(bases_count_file) or os.path.exists(sample_count_file):
sys.exit("ERROR: output basename %s files already exist" % args.output)
#for filtering on different regions
intron = df['Region'].str.contains('intron')
utr = df['Region'].str.contains('utr')
cds = df['Region'].str.contains('cds')
#count regions per sample
unique_regions = df.groupby('Sample').Region.nunique()
unique_intron = df[intron].groupby('Sample').Region.nunique()
unique_utr = df[utr].groupby('Sample').Region.nunique()
unique_cds = df[cds].groupby('Sample').Region.nunique()
#sum bases per sample
bases_total = df.groupby(['Sample'])['Overlap'].sum()
bases_intron = df[intron].groupby(['Sample'])['Overlap'].sum()
bases_utr = df[utr].groupby(['Sample'])['Overlap'].sum()
bases_cds = df[cds].groupby(['Sample'])['Overlap'].sum()
#count samples per region
samples_per_region = df.groupby('Region').Sample.nunique()
#format regions per sample for output
combine_region_count = pd.concat([unique_regions,unique_intron,unique_utr,unique_cds], axis=1)
combine_region_count.columns = 'Total','Intron','UTR','CDS'
#format bases per sample for output
combine_bases = pd.concat([bases_total,bases_intron,bases_utr,bases_cds], axis=1)
combine_bases.columns = 'Total','Intron','UTR','CDS'
#format samples per region for output
#samples_per_region.reset_index(name='Num Samples')
#not sure why this is not working, but not that important
#output each
combine_region_count.to_csv(region_count_file,sep='\t')
combine_bases.to_csv(bases_count_file,sep='\t')
samples_per_region.to_csv(sample_count_file,sep='\t')
我正在尝试编写一个 python 脚本来从以下数据帧生成计数。我在 excel 中使用 countifs,但 'Sample' 和 'Region' 中的重复项导致使用 countifs 出现问题。
示例输入 df:
Sample Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap
101 chr1 198661465 198661475 NM_002838_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 198661465 198661475 NM_001267798_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 198661465 198661475 NM_080921_PTPRC_intron_2_R 10 + chr1 198608563 198661471 0 6
101 chr1 236966727 236966942 NM_000254_MTR_cds_2 215 + chr1 236966742 236966743 11 1
101 chr1 236966727 236966942 NM_001291939_MTR_cds_2 215 + chr1 236966742 236966743 11 1
101 chr1 236966742 236966942 NM_001291940_MTR_5utr_2 200 + chr1 236966742 236966743 11 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979846 236979847 9 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979847 236979848 8 1
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979848 236979852 7 4
101 chr1 236979843 236979853 NM_000254_MTR_intron_8_L 10 + chr1 236979852 236979854 6 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979846 236979847 9 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979847 236979848 8 1
101 chr1 236979843 236979853 NM_001291940_MTR_intron_8_L 10 + chr1 236979848 236979852 7 4
因此,单个样本可以多次列出相同的 'Region'(不同的坐标,但这与计数无关紧要)。
所需输出 1 - 如果 'Region' 包含 "utr" 或 "intron" 或 "cds",则按 'Sample' 计数, 占重复 'Region' 每 'Sample':
Sample Total Intron UTR CDS
101 68 40 13 15
102 64 38 13 13
期望输出 2 - 如果 'Region' 包含 "utr" 或 "intron",则 'Sample' 的 'Overlap' 之和或 "cds":
Sample Total Intron UTR CDS
101 2838 321 1433 1084
102 2524 291 1449 784
所需的输出 3 - 'Region' 的列表,其中包含列出 'Region' 的样本数
Region Num Samples
ENST00000390559_IGHM_cds_4 2
ENST00000390559_IGMH_cds_1 2
ENST00000390559_IGMH_cds_2 2
ENST00000390559_IGMH_cds_3 12
ENST00000390559_IGMH_intron_1_L 2
ENST00000390559_IGMH_intron_1_R 2
ENST00000390559_IGMH_intron_2_L 10
编辑: 我已经弄清楚如何获得输出#3:
df.groupby('Region').Sample.nunique()
我可以得到输出#1 的总计:
df.groupby('Sample').Region.nunique()
现在我只需要弄清楚如何过滤我的组以包含 'utr/cds/intron' 并对过滤组的 'Overlap' 求和。
如果有人 运行 遇到类似的问题,这就是我想出的生成所描述的三个输出的方法。它可能不是最优雅的解决方案,但它确实有效!
import pandas as pd
import argparse
import os
import sys
#arguments
parser = argparse.ArgumentParser(description="Generate counts by sample and total bases by sample of low coverage regions")
parser.add_argument("-i", "--input", help="input filename", required=True)
parser.add_argument("-o", "--output", help="output basename", required=True)
args = parser.parse_args()
#output filenames
region_count_file = args.output + "_region_count.txt"
bases_count_file = args.output + "_bases_count.txt"
sample_count_file = args.output + "_sample_count.txt"
#read in
df = pd.read_table(args.input)
#check output doesn't exist
if os.path.exists(region_count_file) or os.path.exists(bases_count_file) or os.path.exists(sample_count_file):
sys.exit("ERROR: output basename %s files already exist" % args.output)
#for filtering on different regions
intron = df['Region'].str.contains('intron')
utr = df['Region'].str.contains('utr')
cds = df['Region'].str.contains('cds')
#count regions per sample
unique_regions = df.groupby('Sample').Region.nunique()
unique_intron = df[intron].groupby('Sample').Region.nunique()
unique_utr = df[utr].groupby('Sample').Region.nunique()
unique_cds = df[cds].groupby('Sample').Region.nunique()
#sum bases per sample
bases_total = df.groupby(['Sample'])['Overlap'].sum()
bases_intron = df[intron].groupby(['Sample'])['Overlap'].sum()
bases_utr = df[utr].groupby(['Sample'])['Overlap'].sum()
bases_cds = df[cds].groupby(['Sample'])['Overlap'].sum()
#count samples per region
samples_per_region = df.groupby('Region').Sample.nunique()
#format regions per sample for output
combine_region_count = pd.concat([unique_regions,unique_intron,unique_utr,unique_cds], axis=1)
combine_region_count.columns = 'Total','Intron','UTR','CDS'
#format bases per sample for output
combine_bases = pd.concat([bases_total,bases_intron,bases_utr,bases_cds], axis=1)
combine_bases.columns = 'Total','Intron','UTR','CDS'
#format samples per region for output
#samples_per_region.reset_index(name='Num Samples')
#not sure why this is not working, but not that important
#output each
combine_region_count.to_csv(region_count_file,sep='\t')
combine_bases.to_csv(bases_count_file,sep='\t')
samples_per_region.to_csv(sample_count_file,sep='\t')