Challenging Compare and validate question -- plus speed.

madhunk · May 22, 2006, 12:14pm

I have a tab delimited HUGE file (13 million records) with Detail, Metadata and Summary records.

Sample File looks like this

M BESTWESTERN 4 ACTIVITY_CNT_L12 A 3
M AIRTRAN 4 ACTIVITY_CNT_L12 A 3
D BESTWESTERN FIRSTNAME LASTNAME 209 N SANBORN AVE
D BESTWESTERN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
D AIRTRAN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
S BESTWESTERN 2
S AIRTRAN 2

I have split the file into three different files.

Metadata file
Detail file
Summary file

The challenge is to check if the information in Metadata records exist in the Detail record file. The names are not constant and WILL change with every incoming file.

1) The script needs to dynamically check the column in the Metadata record file that contains, for example 'BESTWESTERN' and 'AIRTRAN' and make sure that it also exists in the detail record file.

This is a huge file and need to know the fastest way to process it.

What is the best way to approach this dynamically changing file?
Please advice...

Thank You,
Madhu

madhunk · May 22, 2006, 1:01pm

The only best option that I could find is to do this way...

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2, print patterns if not found
while read pattern1 pattern2
do
if [ ${#pattern2} -eq 0 ]; then # skip when pattern2 isn't there
continue
fi
grep "$pattern1" file2 | grep -q "$pattern2"
if [ $? -ne 0 ]; then
echo "$pattern1" "$pattern2"
fi
done < patternfile

If I can change this script so that if it doesn't find the pattern, it aborts. Should it be fine?

vgersh99 · May 22, 2006, 3:07pm

try something like this:
nawk -f mad.awk file1 file1

mad.awk:

BEGIN {
  outMeta="meta.txt"
  outData="data.txt"
  outSumm="summ.txt"

  stderr="cat 1>&2"
}
FNR==NR{
  if ($1 == "D") data[$2];
  next
}

{
  if ($1 != "D" && ($2 in data) )
     print $0 >> ($1 == "M") ? outMeta : outSumm
  else if ( $1 != "D" )
        printf("WARNING::[%d]: Meta or Summary is NOT in Data: [%s]\n", FNR, $2) | stderr

  if ($1 == "D" )
     print $0 >> outData
}

madhunk · May 22, 2006, 3:23pm

Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:

Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.
3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).
3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.

I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile

vgersh99 · May 22, 2006, 4:25pm

why do you need to count them and why do you need to print them?

hmmm...... I thouight there was just ONE file.
Now you're saying there're TWO files?

It might be a good idea to post sample INPUT file(s???) [if there're multiples] and instead of outlining the algorithm - outlinie the what needs to be done AND a sample end-result given the sample input/file(s)

Also pls use vB codes when posting code and/quotes - it makes reading the posting much easier.

madhunk · May 22, 2006, 4:51pm

Hi vgersh99,

The printing is only for display purposes to see how many partner names does the metadata file has...

I am sorry for the miscommunication.

Sample Input File1 (Metadata File)
Sample Input File2 (Detail File)

The Metadata File has names such as ORBITZ, BESTWESTERN and so on. They should also exist in the Detail File. A comparison needs to be made. Incase, they don't exist the script should fail.

The current code cuts the names and puts that into a temporary file. Then it loops and checks the
existence of these names in the Detail file. If any of the names doesn't exist, then the
script should abort.

I am getting confused about the looping process here...Is this the right way to work through the solution?
Moreover, the detail file in reality has 13 million records.

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
grep "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile

Also attached are the sample files...

Sample Metadata file

M ORBITZ 8 LAST_BOOKED_DATE D
M AIRTRAN 8 TRIPS_YTD A 11
M FRONTIER 5 FLT_COUNT N
M CAESAR 7 DAYSPLAYED A 9
M BESTWESTERN 4 ACTIVITY_CNT_L12 A

Sample Detail file

D BESTWESTERN FIRST LAST 10545 WILLOWS RD NE
D ORBITZ FIRST LAST 550 N CENTRAL ROWIE AZ
D AIRTRAN FIRST LAST 6755B WILLOW BROOK PARK # P
D FRONTIER FIRST LASTNAME PO BOX 370
D CAESAR FIRST LAST 2113 CRIMSCENDDR # 10

I hope I am clear this time...

vgersh99 · May 22, 2006, 5:03pm

ok - not tested.

nawk -f mad.awk DetailFile.txt MetadataFile.txt

mad.awk:

FNR==NR{
   detail[$2]
   next
}
{ 
  printf("Meta [%s] %s found in Detail-- %s\n",  $2, ($2 in detail) ? "" : "NOT",  ($2 in detail) ? "Successful" : "Failed")
}

madhunk · May 22, 2006, 5:09pm

Thank you very much...It did work perfectly...

I would like to give the script a .ksh extension since I have #!/usr/bin/ksh

Thanks again for all the help..Please advice...

vgersh99 · May 22, 2006, 5:22pm

#!/usr/bin/ksh

nawk '
  FNR==NR{
     detail[$2]
     next
  }
  { 
    printf("Meta [%s] %s found in Detail-- %s\n",  $2, ($2 in detail) ? "" : "NOT",  ($2 in detail) ? "Successful" : "Failed")
  }' DetailFile.txt MetadataFile.txt

madhunk · May 22, 2006, 5:53pm

Thanks again....I did follow your code and added it into the .ksh script.

nawk '
  FNR==NR{
     detail[$2]
     next
  }
  {
    printf("Metadata Partner Name [%s] %s found in Detail File-- %s\n",  $2, ($2 in detail) ? "" : "NOT",  ($2 in detail) ? "Successful" : "Failed")
  }' ${DETAIL_FILE} ${METADATA_FILE}

RC=$?

if [ $RC -ne 0 ]
then 
   echo "*** Comparison Failed. Aborting Script... ***"
   exit $RC
else
   echo "*** Comparison Completed ***"
   echo "*** Partner Files compared Successfully ***"
fi

It is not aborting though it did not find the name...

        Detail File file2 to be compared found
Metadata Partner Name [ORBITZ]  found in Detail File-- Successful
Metadata Partner Name [AIRTRAN]  found in Detail File-- Successful
Metadata Partner Name [FRONTIER]  found in Detail File-- Successful
Metadata Partner Name [CAESAR]  found in Detail File-- Successful
Metadata Partner Name [BEST] NOT found in Detail File-- Failed

*** Comparison Completed ***
*** Partner Files compared Successfully ***

Please advice how to abort the flow if it failed to find.

vgersh99 · May 22, 2006, 6:19pm

#!/usr/bin/ksh

nawk '
  FNR==NR{
     detail[$2]
     next
  }
  { 
     if ( $2 in detail)
        printf("Meta [%s] found in Detail-- Succefull\n",  $2)
     else {
        printf("Meta [%s] NOT found in Detail-- Failed\n",  $2)
        _ex=1
     }
  }
  END { exit(_ex)}' DetailFile.txt MetadataFile.txt

madhunk · May 23, 2006, 10:28am

Thank You vgersh99...

I have tested the script on a file that has 13 million records and it took 2.5 minutes.

Just another quick question...

Is there any way to enhance the script?

For example: The Metadata file has name 'AIRTRAN AIRWAYS' but in the Detail File it is
listed as 'AIRTRAN'. Can we make this a pass rather than failure?

When comparing the Metadata names with Detail record names, it should pass on these
conditions.

Can we create a control file like:

AIRTRAN: AIRTRAN AIRWAYS, AIRTRAN, AIRTRAN AIR
MIDWEST: MIDWEST AIRLINES, MIDWEST AIR

and look this up and pass the script....If it doesn't find anything related, the script should be aborted.

I am not sure how I can do this....If you have any idea, please let me know.

vgersh99 · May 23, 2006, 10:45am

The above works fine as long as the name in the METAfile starts with same name as it appears in the DETAIL file: 'AIRTRAN AIRWAYS' in metaFile; 'AIRTRAN' in detailFile.

madhunk · May 23, 2006, 11:13am

Thank You again vgersh99...

I did create two test files and it works if we have an example like that. It doesn't work if there is no space between AIRTRAN and AIRWAYS.
Example: AIRTRANAIRWAYS.

The files we get are really really bad. I am a little scared incase I get a Metadata file with AIRTRANAIRWAYS.

Below is the final script that I have...

#| check for correct number of parameters

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <DETAIL FILE> <METADATA FILE>"
   echo " "
   exit 1
fi

#-------------------------------------------------------------
#    Initialize variables
#-------------------------------------------------------------

DIR=$1
DETAIL_FILE=$2
METADATA_FILE=$3

#-------------------------------------------------------------
#    Check for the existence of the Detail and Metadata files
#-------------------------------------------------------------
cd ${DIR}

if [ -r ${DETAIL_FILE} ]; then
   echo "\tDetail File ${DETAIL_FILE} to be compared found"
else
   echo "\tError: Detail File ${DETAIL_FILE} was not found, Aborting!"
   echo " "
   exit 1
fi

if [ -r ${METADATA_FILE} ]; then
   echo "\tMetadata File ${METADATA_FILE} to be compared found"
else
   echo "\tError: Metadata File ${METADATA_FILE} was not found, Aborting!"
   echo " "
   exit 1
fi

#-------------------------------------------------------------
#    Compare both files for partner names
#-------------------------------------------------------------
time {
nawk '
  FNR==NR{
     detail[$2]
     next
  }
  { 
     if ( $2 in detail)
        printf("Metadata partner name [%s] found in Detail-- Successful\n",  $2)
     else {
        printf("Metadata partner name [%s] NOT found in Detail-- Failed\n",  $2)
        _ex=1
     }
  }
  END { exit(_ex)}' ${DETAIL_FILE} ${METADATA_FILE}  
}

vgersh99 · May 23, 2006, 2:27pm

if you can assume your Metadata file as an 'authoritative' source of metta data definition AND your 'detailedData' file that can vary....

#!/usr/bin/ksh

nawk '
  FNR==NR{
     detail[$2]
     next
  }
  {
     for( i in detail)
       if ( substr($2, 1, length(i)) == i ) {
          printf("Meta [%s] found in Detail-- Succefull\n",  $2)
          next
       }
     printf("Meta [%s] NOT found in Detail-- Failed\n",  $2)
     _ex=1
  }
  END { exit(_ex)}' DetailFile.txt MetadataFile.txt

madhunk · May 24, 2006, 1:26pm

Thank you vgersh99...

I was working with the script since couple of days and it was working fine...

A new file came in today and the script could not abort. The reason is

Metadata Records has:

Metadata partner name [ORBITZ] found in Detail-- Successful
Metadata partner name [AIRTRAN] found in Detail-- Successful
Metadata partner name [FRONTIER] found in Detail-- Successful
Metadata partner name [BESTWESTERN] found in Detail-- Successful

But the Detail Records has:

ORBITZ
AIRTRAN
FRONTIER
BESTWESTERN
MIDWEST

There were additional records for MIDWEST. Is there any way that the script can be modified to accomodate this enhancement?

If not present in Metadata records, but present in Detail -- the script should abort..

Please advice...

vgersh99 · May 24, 2006, 2:20pm

madhunk:

Thank you vgersh99...

I was working with the script since couple of days and it was working fine...

A new file came in today and the script could not abort. The reason is

Metadata Records has:
Metadata partner name [ORBITZ] found in Detail-- Successful
Metadata partner name [AIRTRAN] found in Detail-- Successful
Metadata partner name [FRONTIER] found in Detail-- Successful
Metadata partner name [BESTWESTERN] found in Detail-- Successful
But the Detail Records has:
ORBITZ
AIRTRAN
FRONTIER
BESTWESTERN
MIDWEST
There were additional records for MIDWEST. Is there any way that the script can be modified to accomodate this enhancement?

OK, but there were no METAdata record for 'MIDWEST'. The task was: find ONLY the METAdata records for which there was a corresponding record in the DETAIL file.

I don't understand what you're asking.....
I suggest you take the most recent version of what's been implemented already, try to understand it and figure out how to adjust it based on your vaying input data patterns.

madhunk · May 25, 2006, 1:39pm

I did play with the script and tried to change it...

In your script before, it compares the metadata file with the detail file.

There was a change in the requirement and I wanted to use the detail file as the standard and compare it with the metadata file. I did change the order of the files when calling the script.

Somehow something is going wrong....

vgersh99 · May 25, 2006, 2:29pm

So the 'new' requirement is: if $2 in the 'detail' file does NOT appear as $2 in the 'metadata' file - then abort?

madhunk · May 25, 2006, 2:38pm

Yes....I tried to switch the order of the files in calling -- like this

nawk '
  FNR==NR{
     detail[$2]
     next
  }
  {
     if ( $2 in detail)
        printf("Metadata partner name [%s] found in Detail-- Successful\n",  $2)
     else {
        printf("Metadata partner name [%s] NOT found in Detail-- Failed\n",  $2)
        _ex=1
     }
  }
  END { exit(_ex)}' ${METADATA_FILE} ${DETAIL_FILE}

The problem is: I only have 5 records in Metadata file but I have 13 Million in the Detail file.

If $2 is there in the Detail file but not in the Metadata file, then I am getting this huge output of all the records.