Illumina reads remove duplicate...

After I using the search tool, I still can't find a solution that was related with my trouble.
My input file:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]]]
.
.
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh

Once the line 2 which is nucleotide sequence is exactly match. The rest of the duplicate are removed.
Hopefully can get anybody expert to help me solve with this problem.
Thanks a lot.

#! /usr/bin/perl

use strict;
use warnings;

open ( IN, "data" ) || die "Perl blew up\n";

my $lookup;
my @temp;
my @keep;
my $nuc;

while (<IN>) {

    push @temp, $_;

    if ( /^([A-Z]{30})$/ ) {
       $nuc = $1;
       ++$lookup->{$nuc};
    }

    if ( @temp == 4 ) {
        if ( $lookup->{$nuc} == 1 ) { push @keep, @temp; }
        @temp = (); $nuc = ();

    }

}

print map "$_", @keep;

Hi, thanks for your suggestion.
Sad to said that it can't function well.
When I run the perl script, it keep on mention a long list of :
"Use of uninitialized value in hash element at unique.pl line 23, <IN>line 38928"

Do you have any idea about this problem facing?

Patrick, I would like to see the more of the errors you are getting. What I think may be happening is that maybe on some lines, your nucleotide values may be missing....

Try this code, it is perhaps easier to cut and paste. It should do the exact same thing as the first code. ( I was interested in trying to get this code into one line =). See if you get the same errors...

#! /usr/bin/perl

use strict;
use warnings;

open ( IN, "data" ) || die "Perl blew up\n";
undef $/;

my $str = <IN>;
my $lookup;

while ( $str =~ /(.+\n([A-Z]{30})\n.+\n.+\n)/g ) {

    ++$lookup->{$2};
    print $1 unless $lookup->{$2} > 1;

}

thanks a lot, deindorfer.
I trying your perl script now.
It seem like take a long time to proceed?
My input file got around 7000000++ Illumina reads.
It still running now.
Hopefully this script is worked :slight_smile:
Thanks again, deindorfer.

7 million lines should not be too bad. code #2 is storing all those lines in one variable, but if you are on a production system, you should be ok.

Hi, deindorfer.
Your perl script already finished run.
Unfortunately, the output data is empty:confused:
Why is the reason causing it?!

Base on your sample but not tested on large scale

awk '{a[$2]++}a[$2]==1{$1=RS $1;print}' FS="\n" RS="@" OFS="\n" ORS=""  in_file > out_file

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

For files with greater than 7 mill rows it is preferable not to store all the values in the memory as in production if enough memory is not allocated the script will get killed.

Assuming every line looks like this

line1
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
line 2
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh

Can you clarify if you dont want any line which has the duplicate of just @HWI-ABC123_30DFGGDA or the whole line ?

If its still a requirement leave a note

Hi daptal,

At this stage, I will prefer to just select the info of the first shown unique nucleotide sequence as my "unique" read. Keep all the contents of the first shown nucleotide sequence contents

For example:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA  #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #First shown unique reads
+HWI-ABC555_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA #Seconds shown reads,discards
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Seconds shown reads,discards
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]
.
.
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC555_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Thanks again for your help :slight_smile:

@patrick87,

Can you give more data on your input and output.

Because I couldn't understand the output. ( I could notice only first 8 lines of input as an output) may be am not intelligent enough to crasp it :frowning:

It would be better to give solution.

#!/usr/bin/perl
use strict;
use warnings;

open my $fh , '<' , 'abc.txt' || die "$!";

my @tmp_arr;
my %hash;
while (my $line = <$fh>){
        chomp $line;
        if (@tmp_arr){
                if ($line =~ m/^@/){
                        unless (exists $hash{$tmp_arr[1]}){
                                $hash{$tmp_arr[1]} = 1;
                                map { print "$_\n";} @tmp_arr;
                        }
                        @tmp_arr = ();
                        push @tmp_arr , $line;
                }
                else {
                        push @tmp_arr , $line;
                }
        }
        else {
                push @tmp_arr , $line;
        }
}
if (@tmp_arr && (exists $hash{$tmp_arr[1]}) ){
        $hash{$tmp_arr[1]} = 1;
        map { print "$_\n";} @tmp_arr;
}
close $fh;

From what i could understand from your posts this might be the solution. Please test it on your test data set before running it on 7 mill rows.

Lemme know if it doesnot work properly or as intended.

HTH,
PL

Dear skmdu,

For example:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA  #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA #Seconds shown reads,discards
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Seconds shown reads,discards
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]
@HWI-ABC796_30DFGGDA:1:100:3:1234
 ACGTAGTACCCGGGTTTTTTTTTAAAAA  #Third shown reads
+HWI-ABC796_30DFGGDA:1:100:3:1234
 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 @HWI-ABC140_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Third shown reads
 +HWI-ABC140_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
 .
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Sorry if my question make you feel confusing.
Actually I just consider sequence duplicate based on its nucleotide sequence (line2 contents) no related with its header or its quality score.
But at this stage, I will select those first shown unique nucleotide sequence (line2 contents)
and its header and quality score consider as my unique.
I will consider the rest those nucleotide sequence (line2 contents) which same as the first shown nucleotide sequence as duplicated and wanted to discard it.

Thanks a lot for solving my troubles.
If you have any problem or question, kindly ask me anytime.

---------- Post updated at 01:57 AM ---------- Previous update was at 01:31 AM ----------

Hi daptal,
Sad to said that your perl script can't give me my desired output :frowning:
It gives me something like:

@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC555_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]]]

It is not what I desired output :frowning:

#! /usr/bin/perl

while ( <> ) {
        if ( /^@/ ){
                $key = <>;
                $QL= <>;
                $QL.=<>;
        }
        if ( !exists $hash{$key}){
                push(@array,[$key,$_,$QL]);
                $hash{$key}=1;
        }
}
foreach (@array) {
        print $_->[1].$_->[0].$_->[2];
}
perl illum.pl < filename
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Try this with your sample file first, if the output seems to be okay, then try the same with 7million rows.

Note: Assumed that Quality score will be two lines always.

Thanks a lot, skmdu.
Your perl script work perfectly :slight_smile:

---------- Post updated at 02:22 AM ---------- Previous update was at 02:18 AM ----------

thanks a lot, deindorfer.
Your second perl script get the exactly result like skmdu's perl script.
Hopefully this is the perl script that I preferred for solving my troubles.
Really thanks again for your advice :slight_smile: