解析vcf格式的文本文件
parsing txt files in vcfs format
我想从 txt 文件中提取信息到数据框中,数据中包含以下字段
1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN
我编写了以下代码试图从文件中获取信息,但不知道如何进行。你能帮我指导一些想法吗?
import io
import os
import pandas as pd
def read_vcf(path):
with open('clinvar_final.txt', 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
你可以用
阅读
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
然后你将有列 2) ID
3) POS
4) ALT
print(df[['ID', 'POS', 'ALT']].head())
给予
ID POS ALT
0 475283 1014O42 A
1 542074 1O14122 T
2 183381 1014143 T
3 542075 1014179 T
4 475278 1014217 T
其他信息 ( 1) GENEINFO
5) CLNSIG
6) CLNDN
) 作为一个字符串在列 INFO
中,您可以将它们添加到单独的列中使用regex
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())
结果
0 ISG15:9636
1 ISG15:9636
2 ISG15:9636
3 ISG15:9636
4 ISG15:9636
Name: GENEINFO, dtype: object
0 Benign
1 Uncertain_significance
2 Pathogenic
3 Uncertain_significance
4 Benign
Name: CLNSIG, dtype: object
0 Immunodeficiency_38_with_basal_ganglia_calcifi...
1 Immunodeficiency_38_with_basal_ganglia_calcifi...
2 Immunodeficiency_38_with_basal_ganglia_calcifi...
3 Immunodeficiency_38_with_basal_ganglia_calcifi...
4 Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object
import pandas as pd
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
print(df.columns)
print(df[['ID', 'POS', 'ALT']].head())
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())
我想从 txt 文件中提取信息到数据框中,数据中包含以下字段
1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN
我编写了以下代码试图从文件中获取信息,但不知道如何进行。你能帮我指导一些想法吗?
import io
import os
import pandas as pd
def read_vcf(path):
with open('clinvar_final.txt', 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
你可以用
阅读df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
然后你将有列 2) ID
3) POS
4) ALT
print(df[['ID', 'POS', 'ALT']].head())
给予
ID POS ALT
0 475283 1014O42 A
1 542074 1O14122 T
2 183381 1014143 T
3 542075 1014179 T
4 475278 1014217 T
其他信息 ( 1) GENEINFO
5) CLNSIG
6) CLNDN
) 作为一个字符串在列 INFO
中,您可以将它们添加到单独的列中使用regex
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())
结果
0 ISG15:9636
1 ISG15:9636
2 ISG15:9636
3 ISG15:9636
4 ISG15:9636
Name: GENEINFO, dtype: object
0 Benign
1 Uncertain_significance
2 Pathogenic
3 Uncertain_significance
4 Benign
Name: CLNSIG, dtype: object
0 Immunodeficiency_38_with_basal_ganglia_calcifi...
1 Immunodeficiency_38_with_basal_ganglia_calcifi...
2 Immunodeficiency_38_with_basal_ganglia_calcifi...
3 Immunodeficiency_38_with_basal_ganglia_calcifi...
4 Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object
import pandas as pd
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
print(df.columns)
print(df[['ID', 'POS', 'ALT']].head())
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())