I have a tab delimited HUGE file (13 million records) with Detail, Metadata and Summary records.
Sample File looks like this
M BESTWESTERN 4 ACTIVITY_CNT_L12 A 3
M AIRTRAN 4 ACTIVITY_CNT_L12 A 3
D BESTWESTERN FIRSTNAME LASTNAME 209 N SANBORN AVE
D BESTWESTERN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
D AIRTRAN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
S BESTWESTERN 2
S AIRTRAN 2
I have split the file into three different files.
Metadata file
Detail file
Summary file
The challenge is to check if the information in Metadata records exist in the Detail record file. The names are not constant and WILL change with every incoming file.
1) The script needs to dynamically check the column in the Metadata record file that contains, for example 'BESTWESTERN' and 'AIRTRAN' and make sure that it also exists in the detail record file.
This is a huge file and need to know the fastest way to process it.
What is the best way to approach this dynamically changing file?
Please advice...
The only best option that I could find is to do this way...
#!/usr/bin/ksh
# put the second column into a file,
# make it unique values
awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2, print patterns if not found
while read pattern1 pattern2
do
if [ ${#pattern2} -eq 0 ]; then # skip when pattern2 isn't there
continue
fi
grep "$pattern1" file2 | grep -q "$pattern2"
if [ $? -ne 0 ]; then
echo "$pattern1" "$pattern2"
fi
done < patternfile
If I can change this script so that if it doesn't find the pattern, it aborts. Should it be fine?
Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:
Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.
3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).
3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.
I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.
#!/usr/bin/ksh
# put the second column into a file,
# make it unique values
awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile
why do you need to count them and why do you need to print them?
hmmm...... I thouight there was just ONE file.
Now you're saying there're TWO files?
It might be a good idea to post sample INPUT file(s???) [if there're multiples] and instead of outlining the algorithm - outlinie the what needs to be done AND a sample end-result given the sample input/file(s)
Also pls use vB codes when posting code and/quotes - it makes reading the posting much easier.
The Metadata File has names such as ORBITZ, BESTWESTERN and so on. They should also exist in the Detail File. A comparison needs to be made. Incase, they don't exist the script should fail.
The current code cuts the names and puts that into a temporary file. Then it loops and checks the
existence of these names in the Detail file. If any of the names doesn't exist, then the
script should abort.
I am getting confused about the looping process here...Is this the right way to work through the solution?
Moreover, the detail file in reality has 13 million records.
#!/usr/bin/ksh
# put the second column into a file,
# make it unique values
awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
grep "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile
Also attached are the sample files...
Sample Metadata file
M ORBITZ 8 LAST_BOOKED_DATE D
M AIRTRAN 8 TRIPS_YTD A 11
M FRONTIER 5 FLT_COUNT N
M CAESAR 7 DAYSPLAYED A 9
M BESTWESTERN 4 ACTIVITY_CNT_L12 A
Sample Detail file
D BESTWESTERN FIRST LAST 10545 WILLOWS RD NE
D ORBITZ FIRST LAST 550 N CENTRAL ROWIE AZ
D AIRTRAN FIRST LAST 6755B WILLOW BROOK PARK # P
D FRONTIER FIRST LASTNAME PO BOX 370
D CAESAR FIRST LAST 2113 CRIMSCENDDR # 10
Thanks again....I did follow your code and added it into the .ksh script.
nawk '
FNR==NR{
detail[$2]
next
}
{
printf("Metadata Partner Name [%s] %s found in Detail File-- %s\n", $2, ($2 in detail) ? "" : "NOT", ($2 in detail) ? "Successful" : "Failed")
}' ${DETAIL_FILE} ${METADATA_FILE}
RC=$?
if [ $RC -ne 0 ]
then
echo "*** Comparison Failed. Aborting Script... ***"
exit $RC
else
echo "*** Comparison Completed ***"
echo "*** Partner Files compared Successfully ***"
fi
It is not aborting though it did not find the name...
Detail File file2 to be compared found
Metadata Partner Name [ORBITZ] found in Detail File-- Successful
Metadata Partner Name [AIRTRAN] found in Detail File-- Successful
Metadata Partner Name [FRONTIER] found in Detail File-- Successful
Metadata Partner Name [CAESAR] found in Detail File-- Successful
Metadata Partner Name [BEST] NOT found in Detail File-- Failed
*** Comparison Completed ***
*** Partner Files compared Successfully ***
Please advice how to abort the flow if it failed to find.
#!/usr/bin/ksh
nawk '
FNR==NR{
detail[$2]
next
}
{
if ( $2 in detail)
printf("Meta [%s] found in Detail-- Succefull\n", $2)
else {
printf("Meta [%s] NOT found in Detail-- Failed\n", $2)
_ex=1
}
}
END { exit(_ex)}' DetailFile.txt MetadataFile.txt
I have tested the script on a file that has 13 million records and it took 2.5 minutes.
Just another quick question...
Is there any way to enhance the script?
For example: The Metadata file has name 'AIRTRAN AIRWAYS' but in the Detail File it is
listed as 'AIRTRAN'. Can we make this a pass rather than failure?
When comparing the Metadata names with Detail record names, it should pass on these
conditions.
Can we create a control file like:
AIRTRAN: AIRTRAN AIRWAYS, AIRTRAN, AIRTRAN AIR
MIDWEST: MIDWEST AIRLINES, MIDWEST AIR
and look this up and pass the script....If it doesn't find anything related, the script should be aborted.
I am not sure how I can do this....If you have any idea, please let me know.
The above works fine as long as the name in the METAfile starts with same name as it appears in the DETAIL file: 'AIRTRAN AIRWAYS' in metaFile; 'AIRTRAN' in detailFile.
I did create two test files and it works if we have an example like that. It doesn't work if there is no space between AIRTRAN and AIRWAYS.
Example: AIRTRANAIRWAYS.
The files we get are really really bad. I am a little scared incase I get a Metadata file with AIRTRANAIRWAYS.
Below is the final script that I have...
#| check for correct number of parameters
if [ $# -ne 3 ]
then
echo " "
echo " Incorrect number of parameters entered..."
echo " Correct usage: " $0 "<DIR> <DETAIL FILE> <METADATA FILE>"
echo " "
exit 1
fi
#-------------------------------------------------------------
# Initialize variables
#-------------------------------------------------------------
DIR=$1
DETAIL_FILE=$2
METADATA_FILE=$3
#-------------------------------------------------------------
# Check for the existence of the Detail and Metadata files
#-------------------------------------------------------------
cd ${DIR}
if [ -r ${DETAIL_FILE} ]; then
echo "\tDetail File ${DETAIL_FILE} to be compared found"
else
echo "\tError: Detail File ${DETAIL_FILE} was not found, Aborting!"
echo " "
exit 1
fi
if [ -r ${METADATA_FILE} ]; then
echo "\tMetadata File ${METADATA_FILE} to be compared found"
else
echo "\tError: Metadata File ${METADATA_FILE} was not found, Aborting!"
echo " "
exit 1
fi
#-------------------------------------------------------------
# Compare both files for partner names
#-------------------------------------------------------------
time {
nawk '
FNR==NR{
detail[$2]
next
}
{
if ( $2 in detail)
printf("Metadata partner name [%s] found in Detail-- Successful\n", $2)
else {
printf("Metadata partner name [%s] NOT found in Detail-- Failed\n", $2)
_ex=1
}
}
END { exit(_ex)}' ${DETAIL_FILE} ${METADATA_FILE}
}
if you can assume your Metadata file as an 'authoritative' source of metta data definition AND your 'detailedData' file that can vary....
#!/usr/bin/ksh
nawk '
FNR==NR{
detail[$2]
next
}
{
for( i in detail)
if ( substr($2, 1, length(i)) == i ) {
printf("Meta [%s] found in Detail-- Succefull\n", $2)
next
}
printf("Meta [%s] NOT found in Detail-- Failed\n", $2)
_ex=1
}
END { exit(_ex)}' DetailFile.txt MetadataFile.txt
I was working with the script since couple of days and it was working fine...
A new file came in today and the script could not abort. The reason is
Metadata Records has:
Metadata partner name [ORBITZ] found in Detail-- Successful
Metadata partner name [AIRTRAN] found in Detail-- Successful
Metadata partner name [FRONTIER] found in Detail-- Successful
Metadata partner name [BESTWESTERN] found in Detail-- Successful
But the Detail Records has:
ORBITZ
AIRTRAN
FRONTIER
BESTWESTERN
MIDWEST
There were additional records for MIDWEST. Is there any way that the script can be modified to accomodate this enhancement?
If not present in Metadata records, but present in Detail -- the script should abort..
OK, but there were no METAdata record for 'MIDWEST'. The task was: find ONLY the METAdata records for which there was a corresponding record in the DETAIL file.
I don't understand what you're asking.....
I suggest you take the most recent version of what's been implemented already, try to understand it and figure out how to adjust it based on your vaying input data patterns.
I did play with the script and tried to change it...
In your script before, it compares the metadata file with the detail file.
There was a change in the requirement and I wanted to use the detail file as the standard and compare it with the metadata file. I did change the order of the files when calling the script.
Yes....I tried to switch the order of the files in calling -- like this
nawk '
FNR==NR{
detail[$2]
next
}
{
if ( $2 in detail)
printf("Metadata partner name [%s] found in Detail-- Successful\n", $2)
else {
printf("Metadata partner name [%s] NOT found in Detail-- Failed\n", $2)
_ex=1
}
}
END { exit(_ex)}' ${METADATA_FILE} ${DETAIL_FILE}
The problem is: I only have 5 records in Metadata file but I have 13 Million in the Detail file.
If $2 is there in the Detail file but not in the Metadata file, then I am getting this huge output of all the records.