在 Perl 中查找和合并向下间隔
Finding and merging down intervals in Perl
我从文件中输入的内容如下所示:
该文件以制表符作为分隔符,样本按字母顺序排序,第 2 列和第 3 列中的特征按数字排序。我想要做的是找到重叠和包含的特征并将它们合并为一个特征。
SampleA 100 500
SampleA 200 600
SampleA 300 400
SampleA 700 800
SampleA 900 1100
SampleA 1200 1500
SampleA 1400 1700
SampleA 1600 1900
SampleB 400 600
SampleB 700 900
SampleB 1000 1800
SampleB 1500 1600
SampleB 1900 2500
SampleB 2500 2600
SampleB 3000 3600
SampleB 3100 3400
例如:
前三个 SampleA 案例将变为:
Sample A 100 600
我目前的问题是,当我遍历我的数据结构时,我可以找到事件,但是当我试图合并我的样本时,我有点卡住了。
我的想法只是重做我的循环,直到找到并合并所有内容,但我不确定如何实现这一点。
目前,数据存储在这样的二维数组中:
@storage = [SampleA, start, stop]
my $j = 1;
for (my $i = 0; $i < scalar(@storage); $i++) {
if ($storage[$i][0] eq $storage[$j][0]) {
if ($storage[$i][2] > $storage[$j][1] && $storage[$i][2] < $storage[$j][2]) {
print "Found Overlapp!\n";
}elsif ( $storage[$i][2] > $storage[$j][1] && $storage[$i][2] > $storage[$j][2]) {
print "Found Feature in Feature!\n";
}
}
unless ($j == scalar(@storage)){$j++};
}
我的目标是重新运行此循环,直到找不到进一步的匹配项,因此所有间隔都不重叠,但我被困在这里。
我想我会这样做:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %ranges;
#iterate line by line.
while (<>) {
chomp;
#split by line
my ( $name, $start_range, $end_range ) = split;
#set a variable to see if it's within an existing range.
my $in_range = 0;
#iterate all the existing ones.
foreach my $range ( @{ $ranges{$name} } ) {
#merge if start or end is 'within' this range.
if (
( $start_range >= $range->{start} and $start_range <= $range->{end} )
or
( $end_range >= $range->{start} and $end_range <= $range->{end} )
)
{
## then the start or end is within the existing range, so add to it:
if ( $end_range > $range->{end} ) {
$range->{end} = $end_range;
}
if ( $start_range < $range->{start} ) {
$range->{start} = $start_range;
}
$in_range++;
}
}
#didn't find any matches, so create a new range identity.
if ( not $in_range ) {
push @{ $ranges{$name} }, { start => $start_range, end => $end_range };
}
}
print Dumper \%ranges;
#iterate by sample
foreach my $sample ( sort keys %ranges ) {
#iterate by range (sort by lowest start)
foreach
my $range ( sort { $a->{start} <=> $b->{start} } @{ $ranges{$sample} } )
{
print join "\t", $sample, $range->{start}, $range->{end}, "\n";
}
}
您的数据输出:
SampleA 100 600
SampleA 700 800
SampleA 900 1100
SampleA 1200 1900
SampleB 700 900
SampleB 1000 1800
SampleB 1900 2600
SampleB 3000 3600
虽然这可能不是最有效的算法,因为它会检查 所有 范围 - 但您可能不需要,因为输入数据是有序的 - 您可以只检查 'most recent' 。
由于您的输入经过很好的排序,因此您可以仅使用固定内存对其进行高效过滤。
$_ = <> or exit;
my @sample = split;
while (<>) {
my @newsample = split;
if ($sample[0] ne $newsample[0]
|| $newsample[2] < $sample[1]
|| $sample[2] < $newsample[1]) {
# Unmergeable sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
@sample = @newsample;
}
elsif ($sample[1] <= $newsample[1] && $newsample[2] <= $sample[2]) {
# @newsample is included in @sample. Nothing to do
}
elsif ($sample[1] <= $newsample[1]) {
# This @newsample raises the upper limit
$sample[2] = $newsample[2];
}
elsif ($newsample[2] <= $sample[2]) {
# This @newsample lowers the lower limit.
$sample[1] = $newsample[1];
}
else {
# This @newsample moves both limits
@sample = @newsample;
}
}
# Output the last sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
我从文件中输入的内容如下所示: 该文件以制表符作为分隔符,样本按字母顺序排序,第 2 列和第 3 列中的特征按数字排序。我想要做的是找到重叠和包含的特征并将它们合并为一个特征。
SampleA 100 500
SampleA 200 600
SampleA 300 400
SampleA 700 800
SampleA 900 1100
SampleA 1200 1500
SampleA 1400 1700
SampleA 1600 1900
SampleB 400 600
SampleB 700 900
SampleB 1000 1800
SampleB 1500 1600
SampleB 1900 2500
SampleB 2500 2600
SampleB 3000 3600
SampleB 3100 3400
例如: 前三个 SampleA 案例将变为:
Sample A 100 600
我目前的问题是,当我遍历我的数据结构时,我可以找到事件,但是当我试图合并我的样本时,我有点卡住了。
我的想法只是重做我的循环,直到找到并合并所有内容,但我不确定如何实现这一点。
目前,数据存储在这样的二维数组中:
@storage = [SampleA, start, stop]
my $j = 1;
for (my $i = 0; $i < scalar(@storage); $i++) {
if ($storage[$i][0] eq $storage[$j][0]) {
if ($storage[$i][2] > $storage[$j][1] && $storage[$i][2] < $storage[$j][2]) {
print "Found Overlapp!\n";
}elsif ( $storage[$i][2] > $storage[$j][1] && $storage[$i][2] > $storage[$j][2]) {
print "Found Feature in Feature!\n";
}
}
unless ($j == scalar(@storage)){$j++};
}
我的目标是重新运行此循环,直到找不到进一步的匹配项,因此所有间隔都不重叠,但我被困在这里。
我想我会这样做:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %ranges;
#iterate line by line.
while (<>) {
chomp;
#split by line
my ( $name, $start_range, $end_range ) = split;
#set a variable to see if it's within an existing range.
my $in_range = 0;
#iterate all the existing ones.
foreach my $range ( @{ $ranges{$name} } ) {
#merge if start or end is 'within' this range.
if (
( $start_range >= $range->{start} and $start_range <= $range->{end} )
or
( $end_range >= $range->{start} and $end_range <= $range->{end} )
)
{
## then the start or end is within the existing range, so add to it:
if ( $end_range > $range->{end} ) {
$range->{end} = $end_range;
}
if ( $start_range < $range->{start} ) {
$range->{start} = $start_range;
}
$in_range++;
}
}
#didn't find any matches, so create a new range identity.
if ( not $in_range ) {
push @{ $ranges{$name} }, { start => $start_range, end => $end_range };
}
}
print Dumper \%ranges;
#iterate by sample
foreach my $sample ( sort keys %ranges ) {
#iterate by range (sort by lowest start)
foreach
my $range ( sort { $a->{start} <=> $b->{start} } @{ $ranges{$sample} } )
{
print join "\t", $sample, $range->{start}, $range->{end}, "\n";
}
}
您的数据输出:
SampleA 100 600
SampleA 700 800
SampleA 900 1100
SampleA 1200 1900
SampleB 700 900
SampleB 1000 1800
SampleB 1900 2600
SampleB 3000 3600
虽然这可能不是最有效的算法,因为它会检查 所有 范围 - 但您可能不需要,因为输入数据是有序的 - 您可以只检查 'most recent' 。
由于您的输入经过很好的排序,因此您可以仅使用固定内存对其进行高效过滤。
$_ = <> or exit;
my @sample = split;
while (<>) {
my @newsample = split;
if ($sample[0] ne $newsample[0]
|| $newsample[2] < $sample[1]
|| $sample[2] < $newsample[1]) {
# Unmergeable sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
@sample = @newsample;
}
elsif ($sample[1] <= $newsample[1] && $newsample[2] <= $sample[2]) {
# @newsample is included in @sample. Nothing to do
}
elsif ($sample[1] <= $newsample[1]) {
# This @newsample raises the upper limit
$sample[2] = $newsample[2];
}
elsif ($newsample[2] <= $sample[2]) {
# This @newsample lowers the lower limit.
$sample[1] = $newsample[1];
}
else {
# This @newsample moves both limits
@sample = @newsample;
}
}
# Output the last sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";