awk uniq and longest string of a column as index

yifangt · September 6, 2012, 2:24pm

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions:
1) longest string of each pattern in column 2, ignore any sub-string, as the index;
2) all the unique patterns after 1);
3) print the whole row;

input:

1 ABCDEFGHI longest_sequence1
2  ABCDEFGH substring_a
3    CDEFG  substring_b
4   ACBDEFGH longest_sequence2_# Note_the order ACB
5   ACBDEFG substring_c
6   ABCDE substring_d
7   ADBCE longest_sequence3_# Note the order ADB
8   ADBC substring_e
9   ABC substring_f
10   DBC substring_g

ouput:

1 ABCDEFGHI longest_sequence1
4   ACBDEFGH longest_sequence2_# Note_the order ACB
7          ADBCE  longest_sequence3_# Note the order ADB

I first pickup only the unique patterns of column2

awk !x[$2]++ infile > temp.file

and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!

Yifang

Corona688 · September 6, 2012, 2:45pm

awk isn't the same everywhere, some implementations have much larger line-length limits than others. What's your system?

yifangt · September 6, 2012, 3:31pm

Linux 3.2.0-3-amd64 #1 SMP Thu Jun 28 09:07:26 UTC 2012 x86_64 GNU/Linux

yifangt · September 12, 2012, 4:04pm

I tried perl script, but did not get what I want, actually empty output. Can anybody help me on my code?

#!/usr/bin/perl
#This script is to print the longest string of each type, ignore any substrings

use strict;
use warnings;

my $infile  = $ARGV[0];
my $outfile = $ARGV[1];
my @DB;

open(INFILE, "<$infile") or die "Cannot open the input file $!\n";

while (<INFILE>) {
    chomp $_;
foreach my $member (@DB) {
 if (index($member, $_)>=0) {
    next;
    } else   {
       push (@DB, $_);
    }
}
}
close(INFILE);

open (OUTFILE, ">$outfile") or die "Cannot open the output file $!\n";

foreach my $ID (@DB) {
     print OUTFILE "$ID\n";
    }

close(OUTFILE);

infile.txt:

ABCDEFGHI
ABCDEFGH
CDEFG
ACBDEFGH
ACBDEFG
ABCDE
ADBCE
ADBC
ABC
DBC

And I am expecting output as:

ABCDEFGHI
ACBDEFGH
ADBCE

Thanks again!

delugeag · September 13, 2012, 3:32am

yifangt:

awk !x[$2]++ infile > temp.file
and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!

Yifang

something like that ?

awk '
$3 ~ /^longest_sequence/ {
  if (x[$2] == 0) { print }
  x[$2]++
}
' file > output

yifangt · September 13, 2012, 8:50am

I mean based on the second column, not the third, which I used here for comments description that is actually not there in my real data. Thanks for your input though.
yt

vgersh99 · September 13, 2012, 9:09am

I don't quite follow how you select the 'longest' sequence where you select some, but not the others....
Here's my take (based on your most recent 1-column sample file: awk -f yi.awk myFile
yi.awk:

BEGIN { ord_init()}
function ord_init(  i,t) {
  for (i=0;i<=255;i++) {
    t=sprintf("%c",i)
    _ord[t]=i
  }
}
function norm(str,   i,n)
{
   for(i=1;i<=length(str);i++)
     n+=_ord[sprintf("%c", substr(str,i,1))]
   return(n)

}
{
  _n=norm($1)
   la[_n]=$1
   lc[_n]++
}
END {
  for(i in la)
      if (lc>1)
        print la
}

yifangt · September 13, 2012, 9:26am

Thanks vgersh99!
The key point is the substring of the current line to any of the lines that have been read. Kind of recursively comparison.
What's in my mind is:

read in line;
compare current line to the old ones;
If it is new, remember it;
If it is longer than any of the memory (i.e. any member of the memory is substring of current line), replace the old one with current line;
if it is a substring of any of the memory, ignore current one;

as awk is processing one line at a time, I thought it is good to handle this problem.

If it is new, remember it;

may not be accurate. Each line for sure is a unique string, but can be substring/"parent"string of other.
Thanks a lot!
yi

vgersh99 · September 13, 2012, 10:06am

still a bit vague, but getting your desired output and probably not the most efficient one given the amount of data....
awk -f yi.awk myFile
yi.awk:

{a[$0]}
END {
  for (i in a)
    for (j in a)
      if (length(i) > length(j) && i ~ j)
        delete a[j]

  for (i in a)
    print i
}

yifangt · September 13, 2012, 11:28am

Can I ask why need i<=255?

vgersh99 · September 13, 2012, 11:32am

huh? I don't see any mention of '255' in my most recent posting.
The mention of '255' was in the post where I didn't quite understand what you're after - try the most recent post/solution.

alister · September 13, 2012, 11:52am

It's probably a good idea to avoid regular expressions. If the real data can contain regular expression metacharacters, they could lead to an erroneous result. Even if the data is strictly alphabetical (as in the sample data), it might be a little bit faster to just use index(i,j) .

Regards,
Alister

vgersh99 · September 13, 2012, 12:09pm

nice 'nit-picking' - good idea!
Thanks