Unix Linux Community

Finding duplicates from positioned substring across lines

Shell Programming and Scripting

gapprasath December 23, 2008, 6:20pm 1

I have million's of records each containing exactly 50 characters and have to check the uniqueness of 4 character substring of 50 character (postion known prior) and report if any duplicates are found.

Eg. data...

AAAA00000000000000XXXX0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars

output:
Duplicates are found for XXXY.

I'm new to unix scripting. Can anyone provide me direction?

~GAP

jim_mcnamara December 23, 2008, 6:36pm 2

awk '{ arr[substr($0,50,4))]++ } 
      END { for (i in arr) { if (arr>1) {print arr, i}}}' inputfile

summer_cherry December 24, 2008, 4:43am 3

nawk '{
str=substr($0,19,4)
_[str]++
}
END{
  for(i in _)
    if(_>1)
       print "Duplicated found for "i
}' a.txt