comparison of 2 files using unix or awk

Diya123 · July 12, 2011, 1:22pm

Hello,

I have 2 files and I want them to be compared in a specific fashion

file1:

A_1200_1250
A_1251_1300
B_1301_1350
B_1351_1400
B_1401_1450
C_1451_1500

and so on...

file2:

 1210  1305  1260  1295
1400 1500 1450  1495

Now The script should look for "1200" from A_1200_1250 of file 1 and see if this number falls between column 1 and column 3 or column 2 and column 4 of file2

If it falls between column 1 and column 3 of file 2 assign it positive
If it falls between column 2 and column 4 of file2 assign it negative
It the above 2 condition's are not satisfied assign neutral

output

A_1200_1250 neutral
A_1251_1300 positive
B_1301_1350 negative
B_1351_1400 neutral 
B_1401_1450 positive
C_1451_1500 neutral

Any help or suggestion on this is greatly appreciated

Thanks,

Shell_Life · July 12, 2011, 1:37pm

Your sample range is overlapping:

1210  1305  1260  1295
1400 1500 1450  1495

If you search for:
1261 it is in the first range 1210-1305 and also in the second 1260-1295.

Diya123 · July 12, 2011, 1:43pm

Hi,

You have to look only between column 1 and 3 and column 2 and 4

since 1251 is present between column 1 and 3 (1210-1260)we assign positive.

neutronscott · July 12, 2011, 2:16pm

Is file2 always just 2 lines?

Diya123 · July 12, 2011, 2:19pm

No, Its of 27000 lines and file 1 is of 111000 lines

Shell_Life · July 12, 2011, 2:45pm

#!/usr/bin/ksh
while read mLine; do
  mNbr=$(echo ${mLine} | sed 's/.*_\(.*\)_.*/\1/')
  mFound='N'
  while read mFrom1 mTo1 mFrom2 mTo2; do
    if [[ ${mNbr} -ge ${mFrom1} && ${mNbr} -le ${mTo1} ]]; then
      mFound='Y'
      echo ${mLine} "positive"
      break
    else
      if [[ ${mNbr} -ge ${mFrom2} && ${mNbr} -le ${mTo2} ]]; then
        mFound='Y'
        echo ${mLine} "negative"
        break
      fi
    fi
  done < Range_File
  if [[ ${mFound} = 'N' ]]; then
    echo ${mLine} "neutral"
  fi
done < Search_File

neutronscott · July 12, 2011, 2:48pm

That's a lot of reading. Try nawk with arrays:

#!/usr/bin/nawk -f
NR == FNR {
        range[$1, $3] = "positive";
        range[$4, $2] = "negative";
}
NR != FNR {
        FS = "_"
        found=0
        for (comb in range) {
                split(comb, key, SUBSEP)
                sign = range[comb]
                if (($2 >= key[1]) && ($2 <= key[2]))
                        { found=1; break }
        }
        if (found)
                printf("%s %s\n", $0, sign);
        else
                printf("%s neutral\n", $0);
}

Note I use 'file2' as first input to establish the ranges.

[mute@sunny ~]$ ./range.sh file2 file1
A_1200_1250 neutral
A_1251_1300 positive
B_1301_1350 negative
B_1351_1400 neutral
B_1401_1450 positive
C_1451_1500 neutral

Diya123 · July 12, 2011, 7:24pm

Thank you so much.. Both are programs are running well on small set. When I am trying to run it on 111000 rows its taking more than 2 hours..its still running.

Any suggestion on how to speed up?

Thanks,

Diya

neutronscott · July 12, 2011, 8:04pm

the nawk version is? hmm. could probably do it in C..