Once the line 2 which is nucleotide sequence is exactly match. The rest of the duplicate are removed.
Hopefully can get anybody expert to help me solve with this problem.
Thanks a lot.
Hi, thanks for your suggestion.
Sad to said that it can't function well.
When I run the perl script, it keep on mention a long list of :
"Use of uninitialized value in hash element at unique.pl line 23, <IN>line 38928"
Patrick, I would like to see the more of the errors you are getting. What I think may be happening is that maybe on some lines, your nucleotide values may be missing....
Try this code, it is perhaps easier to cut and paste. It should do the exact same thing as the first code. ( I was interested in trying to get this code into one line =). See if you get the same errors...
#! /usr/bin/perl
use strict;
use warnings;
open ( IN, "data" ) || die "Perl blew up\n";
undef $/;
my $str = <IN>;
my $lookup;
while ( $str =~ /(.+\n([A-Z]{30})\n.+\n.+\n)/g ) {
++$lookup->{$2};
print $1 unless $lookup->{$2} > 1;
}
thanks a lot, deindorfer.
I trying your perl script now.
It seem like take a long time to proceed?
My input file got around 7000000++ Illumina reads.
It still running now.
Hopefully this script is worked
Thanks again, deindorfer.
For files with greater than 7 mill rows it is preferable not to store all the values in the memory as in production if enough memory is not allocated the script will get killed.
At this stage, I will prefer to just select the info of the first shown unique nucleotide sequence as my "unique" read. Keep all the contents of the first shown nucleotide sequence contents
Sorry if my question make you feel confusing.
Actually I just consider sequence duplicate based on its nucleotide sequence (line2 contents) no related with its header or its quality score.
But at this stage, I will select those first shown unique nucleotide sequence (line2 contents)
and its header and quality score consider as my unique.
I will consider the rest those nucleotide sequence (line2 contents) which same as the first shown nucleotide sequence as duplicated and wanted to discard it.
Thanks a lot for solving my troubles.
If you have any problem or question, kindly ask me anytime.
---------- Post updated at 01:57 AM ---------- Previous update was at 01:31 AM ----------
Hi daptal,
Sad to said that your perl script can't give me my desired output
It gives me something like:
Thanks a lot, skmdu.
Your perl script work perfectly
---------- Post updated at 02:22 AM ---------- Previous update was at 02:18 AM ----------
thanks a lot, deindorfer.
Your second perl script get the exactly result like skmdu's perl script.
Hopefully this is the perl script that I preferred for solving my troubles.
Really thanks again for your advice