awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions:
1) longest string of each pattern in column 2, ignore any sub-string, as the index;
2) all the unique patterns after 1);
3) print the whole row;

input:

1 ABCDEFGHI longest_sequence1
2  ABCDEFGH substring_a
3    CDEFG  substring_b
4   ACBDEFGH longest_sequence2_# Note_the order ACB
5   ACBDEFG substring_c
6   ABCDE substring_d
7   ADBCE longest_sequence3_# Note the order ADB
8   ADBC substring_e
9   ABC substring_f
10   DBC substring_g

ouput:

1 ABCDEFGHI longest_sequence1
4   ACBDEFGH longest_sequence2_# Note_the order ACB
7          ADBCE  longest_sequence3_# Note the order ADB

I first pickup only the unique patterns of column2

awk !x[$2]++ infile > temp.file

and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!

Yifang

awk isn't the same everywhere, some implementations have much larger line-length limits than others. What's your system?

Linux 3.2.0-3-amd64 #1 SMP Thu Jun 28 09:07:26 UTC 2012 x86_64 GNU/Linux

I tried perl script, but did not get what I want, actually empty output. Can anybody help me on my code?

#!/usr/bin/perl
#This script is to print the longest string of each type, ignore any substrings

use strict;
use warnings;

my $infile  = $ARGV[0];
my $outfile = $ARGV[1];
my @DB;

open(INFILE, "<$infile") or die "Cannot open the input file $!\n";

while (<INFILE>) {
    chomp $_;
foreach my $member (@DB) {
 if (index($member, $_)>=0) {
    next;
    } else   {
       push (@DB, $_);
    }
}
}
close(INFILE);

open (OUTFILE, ">$outfile") or die "Cannot open the output file $!\n";

foreach my $ID (@DB) {
     print OUTFILE "$ID\n";
    }

close(OUTFILE);

infile.txt:

ABCDEFGHI
ABCDEFGH
CDEFG
ACBDEFGH
ACBDEFG
ABCDE
ADBCE
ADBC
ABC
DBC

And I am expecting output as:

ABCDEFGHI
ACBDEFGH
ADBCE

Thanks again!

something like that ?

awk '
$3 ~ /^longest_sequence/ {
  if (x[$2] == 0) { print }
  x[$2]++
}
' file > output

I mean based on the second column, not the third, which I used here for comments description that is actually not there in my real data. Thanks for your input though.
yt

I don't quite follow how you select the 'longest' sequence where you select some, but not the others....
Here's my take (based on your most recent 1-column sample file: awk -f yi.awk myFile
yi.awk:

BEGIN { ord_init()}
function ord_init(  i,t) {
  for (i=0;i<=255;i++) {
    t=sprintf("%c",i)
    _ord[t]=i
  }
}
function norm(str,   i,n)
{
   for(i=1;i<=length(str);i++)
     n+=_ord[sprintf("%c", substr(str,i,1))]
   return(n)

}
{
  _n=norm($1)
   la[_n]=$1
   lc[_n]++
}
END {
  for(i in la)
      if (lc>1)
        print la
}
1 Like

Thanks vgersh99!
The key point is the substring of the current line to any of the lines that have been read. Kind of recursively comparison.
What's in my mind is:

read in line;
compare current line to the old ones;
If it is new, remember it;
If it is longer than any of the memory (i.e. any member of the memory is substring of current line), replace the old one with current line;
if it is a substring of any of the memory, ignore current one;

as awk is processing one line at a time, I thought it is good to handle this problem.

If it is new, remember it;

may not be accurate. Each line for sure is a unique string, but can be substring/"parent"string of other.
Thanks a lot!
yi

still a bit vague, but getting your desired output and probably not the most efficient one given the amount of data....
awk -f yi.awk myFile
yi.awk:

{a[$0]}
END {
  for (i in a)
    for (j in a)
      if (length(i) > length(j) && i ~ j)
        delete a[j]

  for (i in a)
    print i
}

Can I ask why need i<=255?

huh? I don't see any mention of '255' in my most recent posting.
The mention of '255' was in the post where I didn't quite understand what you're after - try the most recent post/solution.

It's probably a good idea to avoid regular expressions. If the real data can contain regular expression metacharacters, they could lead to an erroneous result. Even if the data is strictly alphabetical (as in the sample data), it might be a little bit faster to just use index(i,j) .

Regards,
Alister

nice 'nit-picking' - good idea! :wink:
Thanks