I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions:
1) longest string of each pattern in column 2, ignore any sub-string, as the index;
2) all the unique patterns after 1);
3) print the whole row;
1 ABCDEFGHI longest_sequence1
4 ACBDEFGH longest_sequence2_# Note_the order ACB
7 ADBCE longest_sequence3_# Note the order ADB
I first pickup only the unique patterns of column2
awk !x[$2]++ infile > temp.file
and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!
I tried perl script, but did not get what I want, actually empty output. Can anybody help me on my code?
#!/usr/bin/perl
#This script is to print the longest string of each type, ignore any substrings
use strict;
use warnings;
my $infile = $ARGV[0];
my $outfile = $ARGV[1];
my @DB;
open(INFILE, "<$infile") or die "Cannot open the input file $!\n";
while (<INFILE>) {
chomp $_;
foreach my $member (@DB) {
if (index($member, $_)>=0) {
next;
} else {
push (@DB, $_);
}
}
}
close(INFILE);
open (OUTFILE, ">$outfile") or die "Cannot open the output file $!\n";
foreach my $ID (@DB) {
print OUTFILE "$ID\n";
}
close(OUTFILE);
I mean based on the second column, not the third, which I used here for comments description that is actually not there in my real data. Thanks for your input though.
yt
I don't quite follow how you select the 'longest' sequence where you select some, but not the others....
Here's my take (based on your most recent 1-column sample file: awk -f yi.awk myFile
yi.awk:
BEGIN { ord_init()}
function ord_init( i,t) {
for (i=0;i<=255;i++) {
t=sprintf("%c",i)
_ord[t]=i
}
}
function norm(str, i,n)
{
for(i=1;i<=length(str);i++)
n+=_ord[sprintf("%c", substr(str,i,1))]
return(n)
}
{
_n=norm($1)
la[_n]=$1
lc[_n]++
}
END {
for(i in la)
if (lc>1)
print la
}
Thanks vgersh99!
The key point is the substring of the current line to any of the lines that have been read. Kind of recursively comparison.
What's in my mind is:
read in line;
compare current line to the old ones;
If it is new, remember it;
If it is longer than any of the memory (i.e. any member of the memory is substring of current line), replace the old one with current line;
if it is a substring of any of the memory, ignore current one;
as awk is processing one line at a time, I thought it is good to handle this problem.
If it is new, remember it;
may not be accurate. Each line for sure is a unique string, but can be substring/"parent"string of other.
Thanks a lot!
yi
huh? I don't see any mention of '255' in my most recent posting.
The mention of '255' was in the post where I didn't quite understand what you're after - try the most recent post/solution.
It's probably a good idea to avoid regular expressions. If the real data can contain regular expression metacharacters, they could lead to an erroneous result. Even if the data is strictly alphabetical (as in the sample data), it might be a little bit faster to just use index(i,j) .