polsum
1
Hi I have a file like this. I need to eliminate lines with first column having the same value 10 times.
13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT
13 18 1 + chromosome 1, 187172197 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 38407155 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 13503259 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 2, 105480832 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 49045535 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 178729626 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 55081462 CAACCGCGACCATACTCT
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
The value 13 in the first column is repeated 10 times in the consecutive lines. I need to eliminate all those lines in the output.
so the desired output will be
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
Thank you much in advance. If it is possible a code in Perl would be much appreciated.
Hi,
Try this code,
#! /usr/local/bin/perl
open(FILE,"<File1") or die("unable to open file");
my @mContent = <FILE>;
my %mFinal = ();
foreach ( @mContent )
{
my $mLine = $_;
chomp ( $mLine );
my $mField = (split(/ /,$mLine,999))[0];
$mFinal{$mField}{"count"}=$mFinal{$mField}{"count"}+1;
$mFinal{$mField}{"content"}=$mLine;
}
foreach my $mField ( keys %mFinal )
{
my $mCount = $mFinal{$mField}{"count"};
if ( $mCount != 10 )
{
print "$mFinal{$mField}{'content'}\n";
}
}
Cheers,
Ranga:)
1 Like
polsum
3
thanks for the reply - but eventhough its NOT printing the repetitive values, its not printing all the remaining values.
the out put was
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
birei
4
Hi polsum,
Here you have another 'perl' solution:
$ cat File1
13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT
13 18 1 + chromosome 1, 187172197 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 38407155 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 13503259 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 2, 105480832 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 49045535 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 178729626 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 55081462 CAACCGCGACCATACTCT
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
$ cat polsum.pl
use warnings;
use strict;
@ARGV == 1 or die qq[Usage: perl $0 input-file\n];
my ($number, @block_lines, $prev, @f);
while ( <> ) {
next if /\A\s*\z/;
chomp;
@f = split;
if ( $. == 1 ) {
++$number;
push @block_lines, $_;
next;
}
if ( $prev == $f[0] ) {
++$number;
}
else {
if ( $number != 10 ) {
printf "%s\n", join qq[\n], @block_lines;
}
$number = 1;
@block_lines = ();
}
push @block_lines, $_;
}
continue {
$prev = $f[0];
if ( eof() && $number != 10 ) {
printf "%s\n", join qq[\n], @block_lines;
}
}
$ perl polsum.pl File1
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
Regards,
Birei
---------- Post updated at 02:07 ---------- Previous update was at 01:59 ----------
rangarasan's code also works for me with next changes:
#! /usr/local/bin/perl
open(FILE,"<File1") or die("unable to open file");
my @mContent = <FILE>;
my %mFinal = ();
foreach ( @mContent )
{
my $mLine = $_;
# chomp ( $mLine );
my $mField = (split(/ /,$mLine,999))[0];
$mFinal{$mField}{"count"}=$mFinal{$mField}{"count"}+1;
$mFinal{$mField}{"content"}.=$mLine; # '.' for concatenate strings.
}
foreach my $mField ( keys %mFinal )
{
my $mCount = $mFinal{$mField}{"count"};
if ( $mCount != 10 )
{
# print "$mFinal{$mField}{'content'}\n";
print "$mFinal{$mField}{'content'}";
}
}
Regards,
Birei
1 Like
yazu
5
Assuming you don't want lines when the first field repeats N times:
awk -v N=10 '
$1 != prev {
if (c != N) for (i=1; i<=c; i++) print a
c = 0
}
{
a[++c] = $0;
prev = $1;
}
END {
if (c != N) for (i=1; i<=c; i++) print a
}' INPUTFILE
1 Like
polsum
6
Thank you very much every one. After few hours of head banging, I came up with my own code which seems to be working fine. yay!
#! /usr/local/bin/perl
use warnings;
use strict;
my %hash;
my $line;
my %dup;
while (<>) {
chomp;
my($x, ) = split;
$line = $_;
$hash{$line} = "\t$x";
}
foreach $line(keys %hash) {
$dup{$hash{$line}}++;
}
foreach $line(keys %hash) {
if ($dup{$hash{$line}} != 10) {
print "$line\n";
}
}