Grep, count and match two files

madhunk · May 26, 2006, 1:32pm

I am writing the below script to do a grep and count number of occurances between two tab delimited files.

I am trying to achieve..

1) Extract column 2 and column 3 from the S.txt file. Put it in a temp pattern file
2) Grep and count column 2 in D.txt file
3) Compare the counts between D.txt and S.txt files.
4) Abort if counts doesn't match

Example: APPLE occurs 4 times in D.txt and is a match in S.txt

#!/usr/bin/ksh

SUM_COUNT=`nawk '{if ($0 ~ /^S/) print $2,$3 >"S1.txt" }'` S.txt

for i in S1.txt
do
DETAIL_COUNT=`grep $i D.txt | wc -l`
if [ ${DETAIL_COUNT} -eq ${SUM_COUNT} ]
then
     echo "Count between Detail and Summary matches"
else
     echo "Count didn't match"
      exit
fi
done

The script goes in a loop and never exits....I am not sure if this is the right way to code.

S.txt

S       APPLES                          4
S       ORANGES                         1
S       PEARS                           1
S       PINEAPPLES                      1
S       TOMATOES                        0
S       PEPPERS                         1

D.txt

D       PINEAPPLES
D       ORANGES
D       PEARS
D       APPLES
D       APPLES
D       APPLES
D       APPLES
D       PEPPERS

I am still in the learning phase and would appreciate any input on this..

tmarikle · May 26, 2006, 2:24pm

Gosh I hope this isn't homework related but this mish mash of commands was too much for me. It's way too easy to do this kind of stuff right in awk by itself. Take a look at this quick example (just so happens I just helped a coworker with a nearly identical business related problem on very large flat files):

nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MKeys++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MKeys++
            }
        }

        # Print a merged set of records
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MKeys) printf("%-20s %10d %10d\n", i, SKeys, DKeys)
    }
' S.txt D.txt

results:

Fruit                   Summary     Detail
==================== ========== ==========
BACON                         0          8
APPLES                        4          6

S.txt

S       APPLES                          4 
S       ORANGES                         1 
S       PEARS                           1 
S       PINEAPPLES                      1 
S       TOMATOES                        0 
S       PEPPERS                         1

D.txt

D       PINEAPPLES 
D       ORANGES 
D       PEARS 
D       APPLES 
D       APPLES 
D       APPLES 
D       BACON
D       BACON
D       BACON
D       APPLES 
D       APPLES 
D       APPLES 
D       PEPPERS 
D       BACON
D       BACON
D       BACON
D       BACON
D       BACON

madhunk · May 30, 2006, 12:29am

Thank You Thomas....It is absolutely not a homework problem. It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.

We get this file every week and we need to validate the count of records for each partner name. (The keys in your code).

The Detail file is almost 15 million records and the Summary file has about 4 or 5 records.

I will take your example and test it out on these huge files tomorrow...

matrixmadhan · May 30, 2006, 2:38am

try this one,
much simpler one :::

#! /usr/bin/ksh

index=1
tmp=1

for val in `sed 's/^.*  //' d.txt`
do
  if [ $tmp -eq 1 ]
  then
     fruit[$index]=$val
     cnt[$index]=0
     tmp=$(($tmp + 1))
  fi
  temp=1
  while [ $temp -le $index ]
  do
     if [ ${fruit[$temp]} = $val ]
     then
        cnt[$temp]=$((${cnt[$temp]} + 1))
        break
     fi
     temp=$(($temp + 1))
  done
if [ $temp -gt $index ]
  then
     index=$(($index + 1))
     fruit[$index]=$val
     cnt[$index]=$((${cnt[$index]} + 1))
  fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
  if [ ${fruit[$temp]} = $first ]
  then
     if [ cnt[$temp] -eq $second ]
     then
        print ${fruit[$temp]} ${cnt[$temp]}
        break
     else
        exit 1
     fi
  fi
  temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done

exit 0

Klashxx · May 30, 2006, 4:27am

Try

while read kk FRUIT NUMBER
do
if [ $(grep -c $FRUIT D.txt) -ne $NUMBER ]
then
   echo " $FRUIT Does not match."
   exit
fi
done<S.txt

matrixmadhan · May 30, 2006, 4:33am

the above point is not met.

unnecessary entire file parsing each time.
this could have been avoided

madhunk · May 30, 2006, 1:23pm

Thank you very much for the help..

I tested Thomas script and it produced no output somehow....It also didn't abort...

I tried Madan's script with the actual file and it produced an output like this
parse.ksh[22]: no space

I need to compare the count of records for each partner name from the Detail file (which is the authoritative source) with the count number from the Summary file.

If the Detail file has records like..

D       MIDWEST FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       MIDWEST FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       CAESAR	FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       AIRTRAN FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       AIRTRAN FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       AIRTRAN FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       CAESAR  FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       MIDWEST FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON
D       MIDWEST FIRSTNAME    LASTNAME              209 N SANBORN AVE               JEFFERSON

Summary File

S       MIDWEST                         4
S       ORBITZ                          0
S       AIRTRAN                         3
S       FRONTIER                        0
S       CAESAR                          2

Detail file should be compared with the Summary file...For example, MIDWEST has 4 records in the Detail file and the Summary file has MIDWEST with count 4, then it is a PASS else FAIL.

I am not sure at this point, which one should I be following...If you can please help me on this, that would be really appreciated.

Thank You,
Madhu

tmarikle · May 30, 2006, 1:38pm

Agreed, my awk solution missed requirement 4 but considering this:

It does have additional benefits such as:

Complete analysis reporting based on both available summary and available detail information.

Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
.
Speed over thre shell script solution.

Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).

In this case, missing the letter of the requirement is probably forgivable.

tmarikle · May 30, 2006, 2:04pm

Whoops, look at your data sample. My awk solution didn't print anything because all record counts matched. It was designed to isolate mismatches only. You can change it if you want a complete report as follows:

nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
            } else {
                MatchedKeys++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
            } else {
                MatchedKeys++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys, DKeys)

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys, DKeys)
    }
' S.txt D.txt

resultsL

Mismatched summary:
Fruit                   Summary     Detail
==================== ========== ==========

Matched summary:
Fruit                   Summary     Detail
==================== ========== ==========
AIRTRAN                       3          3
ORBITZ                        0          0
FRONTIER                      0          0
CAESAR                        2          2
MIDWEST                       4          4

This is my experience also, probably a variable is being overloaded but I have analyzed the script to find out for sure.

tmarikle · May 30, 2006, 2:19pm

A more concise report can be achieved as follows:

nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys, DKeys, (DKeys == SKeys ? "" : "<==== Error ==="))
        }
    }
' S.txt D.txt

Note: I added one more "MIDWEST" and a "BACON" record to D.txt to show the mismatch flag.
Report:

Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3          3                
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0          1 <==== Error ===
CAESAR                        2          2                
MIDWEST                       4          5 <==== Error ===

Testing the two methods with 100,000 records produced 6.11 seconds for the shell script and .80 (subsecond) for the awk script.

madhunk · May 30, 2006, 2:30pm

That is awesome Thomas.....

Please see the report...

Mismatched summary:
Partner                 Summary     Detail
==================== ========== ==========

Matched summary:
Partner                 Summary     Detail
==================== ========== ==========
AIRTRAN                 3191969    3191969
ORBITZ                  5995609    5995609
FRONTIER                1672209    1672209
CAESAR                        0          0
MIDWEST                 1577373    1577373
BESTWESTERN              582813     582813

real    3m40.28s
user    1m45.91s
sys     0m9.46s

I have added time to see how much does it take....I tested on the real file and it is perfect...

Please find the code below...I have tried to parameterize it and not sure how we can in nawk..

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

cd ${DIR}

time {
nawk '
    # Compile summary array
    FILENAME=="$(SUMMARY_FILE}" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="$(DETAIL_FILE}" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
            } else {
                MatchedKeys++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
            } else {
                MatchedKeys++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys, DKeys)

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys, DKeys)
    }' $(SUMMARY_FILE} $(DETAIL_FILE}
}

Does the code abort with status 1 if there is output in Mismatched Summary..

Thank you again for all the help..

tmarikle · May 30, 2006, 2:35pm

To parameterize this correctly, you can add variables to know which file you are looking at:

nawk -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
    }

To exit with a non zero result add a flag to the "END" procedure:

        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
                exit_code=1
            } else {
                MatchedKeys++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys != DKeys) {
                # Add a record to the merged set
                MismatchedKeys++
                exit_code=1
            } else {
                MatchedKeys++
            }
        }
        exit (exit_code)
        ...

By the way, I like the last awk script that I provided you since it provides a complete report and it's more concise.

madhunk · May 30, 2006, 3:15pm

Thank You Thomas...

I used your last script which is more concise...But somehow I am unable to get the parameters passed to the script..

I have also added the exit status as you have mentioned...Can you please take a quick glance and see where I am doing wrong here...

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v directory=${DIR} -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys, DKeys, (DKeys == SKeys ? "" : "<==== Error ==="))
             _ex=1
            }
    }
' directory/summary_file directory/detail_file
}

tmarikle · May 30, 2006, 3:33pm

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v summary_file=${DIR}/${SUMMARY_FILE} -v detail_file=${DIR}/${DETAIL_FILE}  '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys, DKeys, (DKeys == SKeys ? "" : "<==== Error ==="))
             _ex=1
            }
        exit (_ex)
    }
' ${DIR}/${SUMMARY_FILE} ${DIR}/${DETAIL_FILE}
}

${DIR}, if needed, would be added with -v and as the last line along with ${SUMMARY_FILE}. e.g. ${DIR}/${SUMMARY_FILE}

Also, you need to make the last executable line of the "END" procedure an "exit" as follows: exit (_ex)

matrixmadhan · May 31, 2006, 1:42am

that was my mistake,
i assumed and worked out with the first format given for Detail file

try the below one,
this would work fine !!!

#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do
  if [ $tmp -eq 1 ]
  then
     fruit[$index]=$val
     cnt[$index]=0
     tmp=$(($tmp + 1))
  fi
  temp=1
  while [ $temp -le $index ]
  do
     if [ ${fruit[$temp]} = $val ]
     then
        cnt[$temp]=$((${cnt[$temp]} + 1))
        break
     fi
     temp=$(($temp + 1))
  done
if [ $temp -gt $index ]
  then
     index=$(($index + 1))
     fruit[$index]=$val
     cnt[$index]=$((${cnt[$index]} + 1))
  fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
  if [ ${fruit[$temp]} = $first ]
  then
     if [ cnt[$temp] -eq $second ]
     then
        print ${fruit[$temp]} ${cnt[$temp]}
        break
     else
        exit 1
     fi
  fi
  temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done

exit 0

matrixmadhan · May 31, 2006, 2:22am

Accept that complete analysis prior to termination is needed and that would be preferred. Well your solution was more generic !!!

it might have failed for the reason that the parsing of Detail file was not correct, i have modified it now.
I tried with 150009 records and it completed in less than 21 seconds, may be you could try once and see.

time ./fine.ksh
MIDWEST 150004
AIRTRAN 3
CAESAR 2

real    0m20.84s
user    0m20.91s
sys     0m0.09s

prior to that we cannot arrive at which solution is speeding up for the requirement !!!

I always take requirement as the higher priority!!! then work out for the optimization and speed when the former is done

tmarikle · May 31, 2006, 1:05pm

For what it's worth, and in defense of thinking through the performance implications of a scripting solution before coding one:

Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:

while read junk val junk junk junk junk junk junk junk junk junk junk
do
    ...
done < D.txt

This at least eliminates one additional process call to awk but still takes nearly two minutes to execute:

time ksh s2.sh

real    1m57.53s
user    1m57.26s
sys     0m0.10s

There are other methods for improving upon this that can significantly increase the shell script's performance but it will allways fall short of the awk approach due to awk's optimization for text processing.

The awk solution ran in 10 seconds for the same 1 million records:

 time { ksh s.sh ; print $? ;  }
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3     314565 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0     104854 <==== Error ===
CAESAR                        2     209710 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4     419419 <==== Error ===
1

real    0m10.30s
user    0m10.07s
sys     0m0.12s

Running the test again on 150000 records:
Shell script solution:

 time { ksh s2.sh ; print $? ; }
1

real    0m16.72s
user    0m16.67s
sys     0m0.02s

Awk solution:

 time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3      45000 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0      15000 <==== Error ===
CAESAR                        2      30000 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4      60000 <==== Error ===
1

real    0m1.40s
user    0m1.37s
sys     0m0.02s

The shell script operates between 8500 to 9000 records per second given where the first count mismatch is found. The awk solution is operating between 97000 to 107000 records per second.

To be fair, the shell script really isn't minimizing much from its own potential work load since the detail file must be processed fully since a mismatch can occur anywhere in the file. The summary file only contains 4 or 5 records so a few seconds is saved at best. A potential benefit is gained when the summary file increases but, again, it is fairly minimal.

Therefore, the letter of the requirement can still be achieved in the awk solution by adding a test in the END procedure but your performance gain can be measured in milliseconds. I'll leave it out and seek forgiveness instead .

matrixmadhan · May 31, 2006, 1:29pm

i really dont understand why its not running in AIX
i tried it and tested on both solaris and HP-UX and works fine ...

currently i dont have access to an AIX box... so i dont have the reason why its not working !!!!

Perderabo · May 31, 2006, 2:04pm

#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do

A line like:
for val in `awk '{print $2}' d.txt`
when run with 15,000,000 lines in d.txt is asking ksh to construct a command line over 15,000,000 words. Remember that command extends until the "done". Once ksh has the entire compound command in memory, it will compile it and execute the compiled code. The compiled code will be almost as long as the original compound command in this case. ksh is willing to attempt this feat, whether or not it succeeds depends on how the kernel was tuned, how much virtual memory is available, how much tmp space is available, etc.

Consider switching to:
awk '{print $2)' d.txt | while read val

This is almost the same thing, but you don't need 2 copies of 15,000,000 items in core simultaneous.

The
while read junk val junk junk junk junk junk junk junk junk junk junk
solution is better still. You no longer have an awk process pumping 15,000,000 words though a pipe. But
while read junk val junk
would do just as nicely. The end of the line accumulates in the last variable of a read regardless of IFS setting.

tmarikle · May 31, 2006, 2:52pm

Very useful to know regarding ksh and I suceeded in running a test on my Sun box with 1.1 million records. I have seen this effect while tracing but I didn't extend this to 15 million in my mind until you pointed it out.

Again, to beat my previous point to death further still and for what it's worth:

Original shell (which runs faster than "while read junk val junk" interestingly enough):

time { ksh s2.sh ; print $? ; }
1

real    4m45.44s
user    4m43.56s
sys     0m1.32s

Records per second: 3,935
Estimated execution time on 15 million records: slightly longer than 1 hour

Awk solution:

time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                   45000     336963 <==== Error ===
BACON                     15000     112321 <==== Error ===
CAESAR                    30000     224642 <==== Error ===
THOMAS                        1          0 <==== Error ===
MIDWEST                   60000     449304 <==== Error ===
1

real    0m9.25s
user    0m7.78s
sys     0m0.49s

Records per second: 121,430
Estimated execution time on 15 million records: slightly longer than 2 minutes