Compare between column in one file

lurak · January 9, 2012, 5:01am

Hi all, can anyone help me. I want to compare values within column in one file using awk.

Below are the sample input file(1st file):

8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||

below are the comparison files (2nd file):

20735704|20735505|20735604|700333|700666|

What I want to get is, if the 1st file column 2,4,6,8,10,..42 (even column under 43) have data from 2nd file so it will output all the data from 1st row + multiple on 2nd file data.

Below are the sample output

8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666| 
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||No Multiple data|

Can you guys help me? really appreciate the help, and FYI the data that I'm handling is around 20,000,000 rows.

birei · January 9, 2012, 6:58am

Hi lurak,

This is my try with awk, but last line gives me one coincidence:

$ cat file1
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||
$ cat file2
20735704|20735505|20735604|700333|700666|
$ cat script.awk
BEGIN {
        FS = "|"
        if ( ARGC != 3 ) {
                print "Usage: awk -f script.awk <file1> <file2>"
                exit 0
        }

        dummy = ARGV[1]
        ARGV[1] = ARGV[2]
        ARGV[2] = dummy
}

## Process 'file2' of input parameters.
FNR == NR {
        for ( i = 1; i <= NF; i++ ) {
                if ( length( $i ) > 0 ) {
                        comparison[ $i ] = 1
                }

        }
}

## Process 'file1' of input parameters.
FNR < NR {
        for ( i = 2; i <= 42; i += 2 ) {
                if ( comparison[ $i ] ) {
                        multiples = multiples " " $i
                }
        }

        if ( length( multiples ) > 0 ) {
                multiples = "Multiple on" multiples "|"
        } else {
                multiples = "No multiple data|"
        }

        print $0 multiples
        multiples = ""
}
$ awk -f script.awk file1 file2 | cat -n
     1  8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
     2  8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
     3  8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
     4  83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666|
     5  83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||Multiple on 20735704|

Regards,
Birei

vivek_d_r · January 9, 2012, 7:13am

i am not good at awk... anyway here is the normal code... since you mentioned 20,00,000 rows... execution time might be more..

file1 has 1st file content file2 has 'multiples'.. file3 is the output file

rm -r file3
while read line
do
        counter=0
        finalNo=$( echo $line | awk '{ split($0,a,"|"); print length(a) - 1 }' )
        line="$line Multiple on: "
        for (( i = 2; i <= $finalNo; i=`expr $i + 2` )) ### Outer for loop ###
        do
                column1=$( echo $line | cut -d'|' -f$i )
                if [[ "$column1" != "" ]]; then
                        noPresent=$( cat file2 | grep -w "$column1" )
                        if [[ "$noPresent" != "" ]] ; then
                                line="$line $column1"
                                counter=`expr $counter + 1`
                        fi
                fi
        done
        if [ $counter -ne 0 ]; then
                 line="$line |"
                echo $line >> file3
        else
                line="$line |No Multiple found|"
                echo $line >> file3
        fi
done < file1

cat file3
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139||||||||||||||||||||||||||| Multiple on: 20735704 20735505 |
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248||||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 |
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101||||||||||||||||||||||||||||| Multiple on: 700333 700666 20735604 |
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 700666 |
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|207357|20111011141108|400501|20111011141108||||||||||||||||||||||||||||||| Multiple on: |No Multiple found|

i have modifed last line in file1 since last line has 20735704 which is a multiple

mirni · January 9, 2012, 7:19am

Try this out:

awk -F\| '
NR==FNR{
   for(i=1;i<=NF;i++)
      a[$i]=$i;
   next}
{m=""; 
  for(i=2;i<=NF; i+=2) 
     if($i && $i in a) 
        m=m " "$i 
  app = m ? "Multiple on " m "|" : "No multiple data |" ; 
  print $0 app 
}' file2.txt file1.txt

But I agree with birei, that in your sample output, the last line should find 1 match.

lurak · January 10, 2012, 12:06am

Thanks for the big help and already tried the code myself and work like charm. The only problem is, actually the last line is correct. There is no multiple because there is only one data of 20735704 that is why it show no multiple.

As you can see below are the 2nd file.

20735704|20735505|20735604|700333|700666|

if there is only one data from the 2nd file there are no multiple. But if there are, let's say 20735704 and 700333 in one line, there are multiple.

Pardon my bad explanation and bad english. Can you guys help me again?

birei · January 10, 2012, 6:20am

So there must exists at least two coincidences to consider a line has multiples, in case of zero or one coincidence it is assumed there are not coincidences. Is like this?

Regards,
Birei

mirni · January 10, 2012, 6:36am

Just put a counter in there:

awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; cnt=0;
   for(i=2;i<=NF; i+=2)
       if($i && $i in a) {
         m=m " "$i 
         cnt++
      }
  app = cnt > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Or, you can try to count the spaces in variable 'm' and test for that:

awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; 
   for(i=2;i<=NF; i+=2)
       if($i && $i in a)
         m=m " "$i 
  app = gsub(" "," ",m) > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Makes the code shorter, but a little bit more obscure. gsub returns the number of substitutions made.