Shell script to compare two files

ajiwww · April 27, 2011, 8:06am

I have two files; file A and file B. I need all the entries of file A to be compared with file B line by line. If the entry exists on file B, then save those on file C; if no then save it on file D

Note :- all the columns of the lines of file A need to be compared, except the last two columns (date & time)

file A

dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:41
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

file B

dbclstr-b IXTProd02 Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 21:49:14
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 17:51:12
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 17:47:53
qbnawldb021-b AetnaLTC Memphis_Corp_SQL_Full Memphis-Corp-SQL-Full-Application-Backup 04/23/11 17:45:20
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/23/11 19:58:22

desired output

file C (if entries of file A exists in file B)

ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

file D (if entries of file A does not exists in file B)

dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:41

---------- Post updated at 07:06 AM ---------- Previous update was at 07:03 AM ----------

I wrote one script which will compare the entries and save to fileC if enties existed on both the file. But i am not able to put a condition for those which does not exists and save on fileD

cat fileB | while read STATUS CLIENT DB POLICY SCHEDULE DATE TIME
do
grep -w "$DB" fileA | grep -w "$CLIENT" | grep -w "$POLICY" | grep -w "$SCHEDULE" | grep -w "$DATE" >> fileC
done

ctsgnb · April 27, 2011, 8:39am

Lazy way but still ...

awk 'NF>2{NF=NF-2;$1=$1}1' fileA | sort >fileA.s
awk 'NF>2{NF=NF-2;$1=$1}1' fileB | sort >fileB.s
comm -12 fileA.s fileB.s >fileC
comm -23 fileA.s fileB.s >fileD

Lakris · April 27, 2011, 3:22pm

Hi,
I'm not sure about this one because I get three lines in fileC instead of 2 as You state, but anyway, it may work as a hint:

>fileC
>fileD
while read a b c d date time; do  
grep "$a $b $c $d" fileB >> fileC || grep "$a $b $c $d" fileA >> fileD
done < fileA

meaning, if there's no hit in fileB do it again on A and put it in fileD. I'm sure it could be a lot cleaner, without calling grep twice for example.

Best regards,
Lakris

rbatte1 · April 27, 2011, 8:02pm

You could also try this logic which builds a temp file for fileB without the last two fields and then just uses grep -f, however if fileB is large, then the script may a little lacking in performance. I've used the internal code rather than some convoluted echo $line through some sort of field counter, subtract two then echo $line | cut -f -$wanted that spawns several processes for each record trim and is a lot slower, but I've seen it quite a lot elsewhere :wall: and probably used it myself too before I found a better way -

#!/bin/ksh


{ cat fileB | while read line
do
outline="${line% * *}"
echo $outline
done } > temp-fileB

grep -f temp-fileB fileA > fileC
grep -vf temp-fileB fileA > fileD

Does this help?

Let us know how you get on

Robin
Liverpool/Blackburn
UK

ajiwww · April 28, 2011, 7:37am

I was able to do it in another way

cat fileA | while read CLIENT DB POLICY SCHEDULE DATE TIME
do
  if ( grep -w "$DB" fileB | grep -w "$CLIENT" | grep -w "$POLICY" | grep -w "$SCHEDULE" )
  then echo $DB $CLIENT $POLICY $SCHEDULE $DATE $TIME >> fileC
  else echo $DB $CLIENT $POLICY $SCHEDULE $DATE $TIME >> fileD
  fi
done

---------- Post updated at 06:37 AM ---------- Previous update was at 06:30 AM ----------

now situation becomes more complicated; we need to apply 2 more conditions

INITIAL SETUP
base condition
values from column 1 to 5 of fileA should match with fileB
;
if matching, put it on fileC and if not fileD

NOW
values from column 1 to 5 of fileA should match with fileB
and
values of column 6 & 7 of fileA are greater than fileB
;
if matching, put it on fileC and if not fileD

i wrote one script

cat fileA | while read STATUS CLIENT DB POLICY SCHEDULE DATE TIME ; do
  if  ( grep -w "$DB" fileB | grep -w "$CLIENT" | grep -w "$POLICY" | grep -w "$SCHEDULE" ) ; if ( $6 < "'$DATE'"  )  ;  if ( $7 < "'$TIME'" )
  then echo $STATUS $DB $CLIENT $POLICY $SCHEDULE $DATE $TIME >> fileC
  else echo $STATUS $DB $CLIENT $POLICY $SCHEDULE $DATE $TIME >> fileD
  fi
done

but its erroring out as below :wall:

line 14: syntax error near unexpected token `done'

rbatte1 · April 28, 2011, 8:06am

For a large fileB, you will be spawning lots of grep processes, 4 for each record, and that will take time.

You are also assuming that the date can be compared so easily. You will need to reformat them so they come out as yyyy/mm/dd else your comparison would find something with a date of 15/01/2011 as "newer" than 10/02/2011

You could call a conversion for each record, but that could get rather complex. I will have a think. I would still recommend against grep | grep | grep stuff though. It could cripple your system for serisous size files.

Robin

ajiwww · April 28, 2011, 8:34am

files are not so large. date wise, yes you are correct. i need to split and then compare. But once i have a base script, then can modify that date part later. Any idea why the syntax error is coming ?

ygemici · April 28, 2011, 9:41am

# cat fileA
dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:41
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

# cat fileB
dbclstr-b IXTProd02 Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 21:49:14
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 17:51:12
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 17:47:53
qbnawldb021-b AetnaLTC Memphis_Corp_SQL_Full Memphis-Corp-SQL-Full-Application-Backup 04/23/11 17:45:20
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/23/11 19:58:22

# ./test1.sh
fileC
 
fileD
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

# cat test1.sh
>fileC ; >fileD
while read CLIENT DB POLICY SCHEDULE DATE TIME ; do
 while read CLIENTB DBB POLICYB SCHEDULEB DATEB TIMEB ; do
if [ "$CLIENT $DB $POLICY $SCHEDULE" == "$CLIENTB $DBB $POLICYB $SCHEDULEB" ] ; then
if [ $(date -d "$DATEB" '+%s') -lt $(date -d "$DATE" '+%s') ] && [ $(date -d "$TIMEB" '+%s') -lt $(date -d "$TIME" '+%s') ] ; then
    echo "$CLIENT $DB $POLICY $SCHEDULE $DATE $TIME" >> fileC
  else
    echo "$CLIENT $DB $POLICY $SCHEDULE $DATE $TIME" >> fileD
  fi
fi
  done<fileB
done<fileA
echo "fileC" ; more fileC ; echo
echo "fileD" ; more fileD

# ./test2.sh
fileC
 
fileD
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

# cat test2.sh
>fileC ; >fileD
while read lA ; do
 while read lB ; do
if [ "$(echo "$lA"|sed 's/\(.*\) [^ ]* [^ ]*/\1/')" == "$(echo "$lB"|sed 's/\(.*\) [^ ]* [^ ]*/\1/')" ] ; then
if [ $(date -d "$(echo "$lA"|sed 's/.* \([^ ]*\) [^ ]*/\1/')" '+%s') -lt $(date -d "$(echo "$lB"|sed 's/.* \([^ ]*\) [^ ]*/\1/')" '+%s') ] && 
[ $(date -d "$(echo "$lB"|sed 's/.* [^ ]* \([^ ]*\)/\1/')" '+%s') -lt $(date -d "$(echo "$lB"|sed 's/.* [^ ]* \([^ ]*\)/\1/')" '+%s') ] ; then
    echo "$lA" >> fileC
  else
    echo "$lA" >> fileD
  fi
fi
  done<fileB
done<fileA
echo "fileC" ; more fileC ; echo
echo "fileD" ; more fileD

# ./test3.sh
fileC
 
fileD
ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18
pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

# cat test3.sh
>fileC ; >fileD
while read lA ; do
 while read lB ; do
if [ "$(echo "$lA"|awk '{print $1 " " $2 " " $3 " " $4}' )" == "$(echo "$lB"|awk '{print $1 " " $2 " " $3 " " $4}' )" ] ; then
if [ $(date -d "$(echo "$lA"|awk '{print $5}')" '+%s') -lt $(date -d "$(echo "$lB"|awk '{print $5}')" '+%s') ] && 
[ $(date -d "$(echo "$lB"|awk '{print $6}')" '+%s') -lt $(date -d "$(echo "$lB"|awk '{print $6}')" '+%s') ] ; then
    echo "$lA" >> fileC
  else
    echo "$lA" >> fileD
  fi
fi
  done<fileB
done<fileA
echo "fileC" ; more fileC ; echo
echo "fileD" ; more fileD

regards
ygemici

ajiwww · April 29, 2011, 2:08am

i tried test1.sh and test3.sh and both are not working, its keep on looping and not creating any fileC and dumping lots of data to fileD. :wall:

Pls try on below files

CONDITIONS

values from column 1 to 5 of fileA should match with fileB
and
values of column 6 & 7 of fileA are greater than fileB
;
if matching, put those matching entries that matched with fileB on fileC and if not then put those unmatching entries of fileA on fileD

fileA

150 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:41
129 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
6 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

fileB

0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/22/11 11:24:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 09:24:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 12:24:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:29:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:55
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/24/11 11:24:41
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/23/11 19:58:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 09:58:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 21:58:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:59:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:32
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/25/11 19:58:22
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/22/11 11:03:18
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 01:03:18
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18

DESIRED OUTPUT
fileC (if all the conditions are met, then put those successful conditions of fileB on fileC)

0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 12:24:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:29:41
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:24:55
0 dbclstr-b IXT_Web Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/24/11 11:24:41
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 21:58:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:59:22
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/24/11 19:58:32
0 ebs-sql1-b EBSCClaimStore Memphis_Prod_SQL_Diff Memphis-Prod-SQL-Inc-Application-Backup 04/25/11 19:58:22
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 12:03:18
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:13:18
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:28
0 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/24/11 11:03:18

fileD(if any conditions are not met, then put those unsuccessful conditions of fileA on fileD)

6 pmemcfdb001-b ERTL Memphis_Prod_SQL_Full Memphis-Prod-SQL-Full-Application-Backup 04/23/11 11:03:18