Difference of two data files & writing to an outfile.

filter · June 13, 2011, 11:31am

Hi Everyone,
I have two files i.e. one file2 contains today's data and the other file1 contains Yesterday's data.
The data in the files contains 226 columns and the data for the coulums separated by a Pipe "|" delimiter.
Now, I have 4 Primary keys (coulumns) by which I have to compare file2 and file1 and generate a diff file that is the data that is not present in file2 but present in file1 and vice versa.
The sample of the data that contains in file1 is ( 1 record):

START-OF-FILE
FILENAME=fixedincome_bo_namr.dif
DATA=bo
REGION=namr
TYPE=dif
PROGRAMNAME=getdata
DATEFORMAT=yyyymmdd
START-OF-FIELDS
# Security description
TICKER
CPN
MATURITY
SERIES
NAME
SHORT_NAME
ISSUER_INDUSTRY
MARKET_SECTOR_DES
CPN_FREQ
CPN_TYP
MTY_TYP
CALC_TYP_DES
DAY_CNT
MARKET_ISSUE
COUNTRY
CRNCY
COLLAT_TYP
AMT_ISSUED
AMT_OUTSTANDING
MIN_PIECE
MIN_INCREMENT
PAR_AMT
LEAD_MGR
EXCH_CODE
REDEMP_VAL
# Issuance information
ANNOUNCE_DT
FIRST_SETTLE_DT
FIRST_CPN_DT
INT_ACC_DT
ISSUE_DT
ISSUE_PX
# Identifiers
ID_EUROCLEAR
ID_XTRAKTER
ID_SEDOL1
ID_SEDOL2
ID_CEDEL
ID_WERTPAPIER
ID_ISIN
ID_DUTCH
ID_VALOREN
ID_FRENCH
ID_COMMON
ID_JAPAN
ID_BELGIUM
ID_DANISH
ID_AUSTRIAN
ID_LUXEMBOURG
ID_SWEDISH
ID_NORWAY
ID_JAPAN_COMPANY
ID_SPAIN
ID_ITALY
ID_BB_COMPANY
ID_BB_SECURITY
ID_CUSIP
# Schedules
NXT_CALL_DT
NXT_CALL_PX
NXT_PAR_CALL_DT
NXT_PUT_DT
NXT_PUT_PX
NXT_PAR_PUT_DT
NXT_CPN_DT
NXT_SINK_DT
NXT_REFUND_DT
# Ratings
RTG_JCR
# Floaters
REFIX_FREQ
NXT_REFIX_DT
RESET_IDX
# Inflation-indexed bonds
# Preferreds
PFD_DVD_PAY_DT
PFD_EX_DVD_DT
PFD_RST_DVD
# Convertibles
CV_COMMON_TICKER
CV_COMMON_TICKER_EXCH
CV_CNVS_RATIO
CV_UNTIL
CV_CNVS_FEXCH_RT
CV_PROV_PX
CALLABLE
SINKABLE
PUTABLE
ID_BB_PARENT_CO
PARENT_COMP_NAME
PARENT_COMP_TICKER
CNTRY_OF_INCORPORATION
BASIC_SPREAD
INDUSTRY_SECTOR
INDUSTRY_GROUP
INDUSTRY_SUBGROUP
COUNTRY_GUARANTOR
CNTRY_OF_DOMICILE
SECURITY_DES
144A_FLAG
FLT_BENCH_MULTIPLIER
SECURITY_TYP
CV_START_DT
GUARANTOR_NAME
PREV_CPN_DT
NXT_SINK_AMT
CALL_DISCRETE
PUT_DISCRETE
MAKE_WHOLE_CALL
ID_BB_UNIQUE
LONG_COMP_NAME
REDEMP_CRNCY
CPN_CRNCY
DTC_ELIGIBLE
STRUCTURED_NOTE
PCT_PAR_QUOTED
PCS_QUOTE_TYP
IS_UNIT_TRADED
IS_REVERSE_CONVERTIBLE
TRADE_CRNCY
BEARER
REGISTERED
CALLED
CALLED_DT
ISSUER
CALL_FEATURE
PUT_FEATURE
PENULTIMATE_CPN_DT
FLT_CPN_CONVENTION
CUR_CPN
FLOATER
TRADE_STATUS
CDR_COUNTRY_CODE
CDR_SETTLE_CODE
SEASONING_STATUS
FINAL_MATURITY
PRVT_PLACE
CALC_TYP
REMOVAL_REASON
IS_PERPETUAL
IS_REG_S
CALLED_PX
DEFAULTED
GILTS_EX_DVD_DT
MOST_RECENT_REPORTED_FACTOR
NXT_FACTOR_DT
OID_BOND
DELIVERY_TYP
ID_SEDOL3
ID_SEDOL4
ID_SEDOL5
SEDOL1_COUNTRY_ISO
SEDOL2_COUNTRY_ISO
SEDOL3_COUNTRY_ISO
SEDOL4_COUNTRY_ISO
SEDOL5_COUNTRY_ISO
ID_MIC1
ID_MIC2
ID_MIC3
ID_MIC4
ID_MIC5
CV_SH_PAR
DUAL_CRNCY
EXTENDIBLE
EXCHANGEABLE
IS_SOFT_CALL
CV_MANDATORY_CNVS
EU_SAVINGS_DIRECTIVE
ID_CUSIP_REAL
INDUSTRY_SUBGROUP_NUM
SECURITY_TYP2
LAST_REFIX_DT
ISO_COUNTRY_GUARANTOR
DTC_REGISTERED
CALL_PARTIAL
CV_CNVS_PX
IS_CURRENT_GOVT
FIRST_CALL_DT_ISSUANCE
UNDL_ID_BB_UNIQUE
EST_CPN_FLAG
ID_BB_GUARANTOR
IS_DAY_PAYER
STEPUP_CPN
STEPUP_DT
CALC_MATURITY
ID_EXCH_SYMBOL
CREDIT_ENHANCEMENTS
INSURANCE_STATUS
JUNIOR
SENIOR
FLT_PAY_DAY
FLT_DAYS_PRIOR
INFLATION_LINKED_INDICATOR
DAYS_TO_SETTLE
TYPE_OF_BOND
REFERENCE_INDEX
BASE_CPI
CFI_CODE
CPN_FREQ_YLD_CNV
DAY_PAYER_FREQ
EX_DIV_DAYS
EX_DIV_CALENDAR
CONTINGENT_CONVERSION
CONTRIB_DATA_INDICATOR
SECURITY_FACTORABLE
ID_BB_GLOBAL
END-OF-FIELDS
TIMESTARTED=Wed Jun  8 18:33:51 EDT 2011
  230 START-OF-DATA
  231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE      POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | |       |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin      ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y      |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | |      N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| |       | |N.A.|2| | | | |N|N|BBG00048KQJ7|

The sample of the data that contains in file2 (Todays file) is ( 1 record):

START-OF-FILE
FILENAME=fixedincome_bo_namr.dif
DATA=bo
REGION=namr
TYPE=dif
PROGRAMNAME=getdata
DATEFORMAT=yyyymmdd
START-OF-FIELDS
# Security description
TICKER
CPN
MATURITY
SERIES
NAME
SHORT_NAME
ISSUER_INDUSTRY
MARKET_SECTOR_DES
CPN_FREQ
CPN_TYP
MTY_TYP
CALC_TYP_DES
DAY_CNT
MARKET_ISSUE
COUNTRY
CRNCY
COLLAT_TYP
AMT_ISSUED
AMT_OUTSTANDING
MIN_PIECE
MIN_INCREMENT
PAR_AMT
LEAD_MGR
EXCH_CODE
REDEMP_VAL
# Issuance information
ANNOUNCE_DT
FIRST_SETTLE_DT
FIRST_CPN_DT
INT_ACC_DT
ISSUE_DT
ISSUE_PX
# Identifiers
ID_EUROCLEAR
ID_XTRAKTER
ID_SEDOL1
ID_SEDOL2
ID_CEDEL
ID_WERTPAPIER
ID_ISIN
ID_DUTCH
ID_VALOREN
ID_FRENCH
ID_COMMON
ID_JAPAN
ID_BELGIUM
ID_DANISH
ID_AUSTRIAN
ID_LUXEMBOURG
ID_SWEDISH
ID_NORWAY
ID_JAPAN_COMPANY
ID_SPAIN
ID_ITALY
ID_BB_COMPANY
ID_BB_SECURITY
ID_CUSIP
# Schedules
NXT_CALL_DT
NXT_CALL_PX
NXT_PAR_CALL_DT
NXT_PUT_DT
NXT_PUT_PX
NXT_PAR_PUT_DT
NXT_CPN_DT
NXT_SINK_DT
NXT_REFUND_DT
# Ratings
RTG_JCR
# Floaters
REFIX_FREQ
NXT_REFIX_DT
RESET_IDX
# Inflation-indexed bonds
# Preferreds
PFD_DVD_PAY_DT
PFD_EX_DVD_DT
PFD_RST_DVD
# Convertibles
CV_COMMON_TICKER
CV_COMMON_TICKER_EXCH
CV_CNVS_RATIO
CV_UNTIL
CV_CNVS_FEXCH_RT
CV_PROV_PX
CALLABLE
SINKABLE
PUTABLE
ID_BB_PARENT_CO
PARENT_COMP_NAME
PARENT_COMP_TICKER
CNTRY_OF_INCORPORATION
BASIC_SPREAD
INDUSTRY_SECTOR
INDUSTRY_GROUP
INDUSTRY_SUBGROUP
COUNTRY_GUARANTOR
CNTRY_OF_DOMICILE
SECURITY_DES
144A_FLAG
FLT_BENCH_MULTIPLIER
SECURITY_TYP
CV_START_DT
GUARANTOR_NAME
PREV_CPN_DT
NXT_SINK_AMT
CALL_DISCRETE
PUT_DISCRETE
MAKE_WHOLE_CALL
ID_BB_UNIQUE
LONG_COMP_NAME
REDEMP_CRNCY
CPN_CRNCY
DTC_ELIGIBLE
STRUCTURED_NOTE
PCT_PAR_QUOTED
PCS_QUOTE_TYP
IS_UNIT_TRADED
IS_REVERSE_CONVERTIBLE
TRADE_CRNCY
BEARER
REGISTERED
CALLED
CALLED_DT
ISSUER
CALL_FEATURE
PUT_FEATURE
PENULTIMATE_CPN_DT
FLT_CPN_CONVENTION
CUR_CPN
FLOATER
TRADE_STATUS
CDR_COUNTRY_CODE
CDR_SETTLE_CODE
SEASONING_STATUS
FINAL_MATURITY
PRVT_PLACE
CALC_TYP
REMOVAL_REASON
IS_PERPETUAL
IS_REG_S
CALLED_PX
DEFAULTED
GILTS_EX_DVD_DT
MOST_RECENT_REPORTED_FACTOR
NXT_FACTOR_DT
OID_BOND
DELIVERY_TYP
ID_SEDOL3
ID_SEDOL4
ID_SEDOL5
SEDOL1_COUNTRY_ISO
SEDOL2_COUNTRY_ISO
SEDOL3_COUNTRY_ISO
SEDOL4_COUNTRY_ISO
SEDOL5_COUNTRY_ISO
ID_MIC1
ID_MIC2
ID_MIC3
ID_MIC4
ID_MIC5
CV_SH_PAR
DUAL_CRNCY
EXTENDIBLE
EXCHANGEABLE
IS_SOFT_CALL
CV_MANDATORY_CNVS
EU_SAVINGS_DIRECTIVE
ID_CUSIP_REAL
INDUSTRY_SUBGROUP_NUM
SECURITY_TYP2
LAST_REFIX_DT
ISO_COUNTRY_GUARANTOR
DTC_REGISTERED
CALL_PARTIAL
CV_CNVS_PX
IS_CURRENT_GOVT
FIRST_CALL_DT_ISSUANCE
UNDL_ID_BB_UNIQUE
EST_CPN_FLAG
ID_BB_GUARANTOR
IS_DAY_PAYER
STEPUP_CPN
STEPUP_DT
CALC_MATURITY
ID_EXCH_SYMBOL
CREDIT_ENHANCEMENTS
INSURANCE_STATUS
JUNIOR
SENIOR
FLT_PAY_DAY
FLT_DAYS_PRIOR
INFLATION_LINKED_INDICATOR
DAYS_TO_SETTLE
TYPE_OF_BOND
REFERENCE_INDEX
BASE_CPI
CFI_CODE
CPN_FREQ_YLD_CNV
DAY_PAYER_FREQ
EX_DIV_DAYS
EX_DIV_CALENDAR
CONTINGENT_CONVERSION
CONTRIB_DATA_INDICATOR
SECURITY_FACTORABLE
ID_BB_GLOBAL
END-OF-FIELDS
 TIMESTARTED=Thu Jun  9 18:34:19 EDT 2011
 230 START-OF-DATA
 231 9999X01M9 Govt|-1|198|XIB|0|20110707| |WI TSY BILL|WI TSY BILL|USGN|Govt| |NONE|NORMAL|DISCOUNT|2|US GOVT|US|USD| |31782000000|31782000000|100|100| | | |     100.00000| | |20110106| |20110609|0.005000| | | | | | | | | | | | | | | | | | | | | |349057|13714872|9127952X8| | | | | | | | | | | | | | | | | | | | | |      |N|N|N|218252|United States of America|3352Z|US| |Government|Sovereign|Sovereign| |US|XIB 07/07/11| | |US GOVERNMENT| | | | | | |N.A.|GV9999X01M9|United      States Treasury Bill - WI Post Auction|USD|USD| | |Y|2|N| |USD|N.A.|N.A.|N| |WI TSY BILL| | | | | |N|Y|US|US| |20110707|N.A.|5| | | | | | | | |Y| | | |      |N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | | | |N.A.|N| |Non-Grandfathered|9999X01M9|459|Bill| | | | | |N| | | | |N| | |20110707| | | | | | | |N|1| | | | | | |      | | | |N|BBG001CSH9Y7|

Since I have 4 primary keys, I though of doing the code using Hashes.
Could someone please help me out. I am looking for a desperate help on this.
Really appreciated if someone could help out with some ideas.

getmmg · June 13, 2011, 11:45am

Do you want to find the differences file1 and file2 for the line starting with 231?
If yes, what are the keys?

filter · June 13, 2011, 12:14pm

Hi getmmg,

Yes. Thats correct I have to start finding the differences from line 231.

The primary keys are as follows:

ID_ISIN
ID_CUSIP
ID_BB_UNIQUE
SECURITY DES

Thanks a lot for your reply.

Shell_Life · June 13, 2011, 12:43pm

Your primary keys have the following sequence in the record:

49:ID_ISIN
66:ID_CUSIP
119:ID_BB_UNIQUE
188:UNDL_ID_BB_UNIQUE

To extract the primary key values from the file:

egrep '|' input_file | cut -d'|' -f49,66,119,188

Then you can compare the extracted values.

filter · June 13, 2011, 12:49pm

Hi Shell_life,

Thanks for your reply.

So once I extract the primary keys using your command, then I have to push them to a hash ( keys & values) and then start comparing ?

 
egrep '|' input_file | cut -d'|' -f49,66,119,188

Like The Outfile should contain, the records that exits in file1 but doesn't exist in file2 and vice versa.

I am sorry Could you please explain me a little bit more.

Shell_Life · June 13, 2011, 12:57pm

The result should be output to a file:

egrep '|' Input_Yesterday | cut -d'|' -f49,66,119,188 > File_Yesterday
egrep '|' Input_Today | cut -d'|' -f49,66,119,188 > File_Today

After having these two files, you can use either "diff" or "cmp" commands.

Choose which command works best for you.

filter · June 13, 2011, 1:14pm

Hi Shell_life,

Thanks a lot for your reply.

Once I get the two files after running your commands and then I would use the diff() command to write to an outfile.

But the outfile contains only four columns.

For example:

The below record exists in file1.

 231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE      POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | |       |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin      ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y      |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | |      N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| |       | |N.A.|2| | | | |N|N|BBG00048KQJ7|

If the above record doesn't exists in file 2 ( by comapring the Primary keys ) then the complete record should write in the outfile.

So, after running the below script:

egrep '|' Input_Yesterday | cut -d'|' -f49,66,119,188 > File_Yesterday
egrep '|' Input_Today | cut -d'|' -f49,66,119,188 > File_Today

and after using the diff(), the put file contains only the four columns where as we need the entire record in the outfile.

Could you please suggest me how can I proceed further and whether this would be easier using Shell or perl.

rdcwayx · June 13, 2011, 8:31pm

Guess one file has only one record.

awk -F "|" 'NR==FNR&&/\|/ {A=$49 FS $66 FS $119 FS $188;B=$0;C=FILENAME}
            NR>FNR &&/\|/ {if (A==$49 FS $66 FS $119 FS $188) 
                             { print "4 primary keys are same in "  C " and " FILENAME }
                          else {print "4 primary keys are NOT same in "  C " and " FILENAME}
                          print ""
                          print "Record in file " C " : \n" B
                          print ""
                          print "Record in file " FILENAME " : \n" $0
                          } ' file1 file2

4 primary keys are NOT same in file1 and file2

Record in file file1 :
  231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE      POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | |       |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin      ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y      |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | |      N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| |       | |N.A.|2| | | | |N|N|BBG00048KQJ7|

Record in file file2 :
 231 9999X01M9 Govt|-1|198|XIB|0|20110707| |WI TSY BILL|WI TSY BILL|USGN|Govt| |NONE|NORMAL|DISCOUNT|2|US GOVT|US|USD| |31782000000|31782000000|100|100| | | |     100.00000| | |20110106| |20110609|0.005000| | | | | | | | | | | | | | | | | | | | | |349057|13714872|9127952X8| | | | | | | | | | | | | | | | | | | | | |      |N|N|N|218252|United States of America|3352Z|US| |Government|Sovereign|Sovereign| |US|XIB 07/07/11| | |US GOVERNMENT| | | | | | |N.A.|GV9999X01M9|United      States Treasury Bill - WI Post Auction|USD|USD| | |Y|2|N| |USD|N.A.|N.A.|N| |WI TSY BILL| | | | | |N|Y|US|US| |20110707|N.A.|5| | | | | | | | |Y| | | |      |N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | | | |N.A.|N| |Non-Grandfathered|9999X01M9|459|Bill| | | | | |N| | | | |N| | |20110707| | | | | | | |N|1| | | | | | |      | | | |N|BBG001CSH9Y7|

filter · June 14, 2011, 5:15pm

Hi rdcwayx,

Really appreciate your time in writing the awk script.

But each of these files have 500K records.

But I have managed to write a perl script for the diff file. i.e.

#!/usr/local/bin/perl
$self = $0;
$self =~ s!^.*/!!;
#
$[ = 1; # = number of first index into arrays and strings
#
$FIELD_SEPARATOR = '\t';
$FIELD_NUMBER_LIST =('38','82');
$field_separator = $FIELD_SEPARATOR;
$field_number_list = $FIELD_NUMBER_LIST;
#
while (@ARGV)
{
    $_ = shift;
    if    (/^-F$/)    { $field_separator = shift; }
    elsif (/^-L$/)    { $field_number_list = shift; }
    elsif (/^-F.+$/)  { $field_separator = substr($_,$[+2); }
    elsif (/^-L.+$/)  { $field_number_list = substr($_,$[+2); }
    #else              { push(@filename, $_); }
}
#
$file_a = 'file1';
$file_b = 'file2';
#
unless (($file_a ne "") && (-f $file_a))
{
    die "Error: Can't find file '$file_a'!\n";
}
unless (($file_b ne "") && (-f $file_b))
{
    die "Error: Can't find file '$file_b'!\n";
}
#
@index_list = split(/,/, $field_number_list);
#
# Scan first file, Pass 1:
open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
     $intersection{$key} = 1;
}
#
close(FILE_A);
# Scan second file, Pass 1:
#
$empty_intersection = 1;
#
open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
 $code = $intersection{$key};
if ($code == 1)
    {
        $intersection{$key} = 3;
        $empty_intersection = 0;
    }
    else
    {
        if ($code != 3) { $intersection{$key} = 2; }
    }
}
#
close(FILE_B);
#
# Prepare output file names:
$file_a_1 = $file_a . '.1';
#
# Scan first file, Pass 2:
#
open(FILE_A, "<$file_a")     || die "Can't open '$file_a': $!\n";
open(FILE_A_1, ">$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
       # 
    }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_A);
close(FILE_A_1);
#
# Scan second file, Pass 2:
#
open(FILE_B, "<$file_b")     || die "Can't open '$file_b': $!\n";
open(FILE_A_1, ">>$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
      #      }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_B);
close(FILE_A_1);
#
# Display results:
#
printf("The Diff file created '%s'\n\n", $file_a_1);
#

The above code works perfectly for generating the diff file i.e. depending upon the primary keys (here taken 2) the outfile contains the records that exists in file1 but not in file2 and the records that exists in file2 but not in file1.

Now,

I need to compare the whole record(line) if the primary keys in file1 matches with the primary keys in file2. If both the lines are equal then discard else write to the outfile.

Could someone please help me out in order to the above step.

Really appreciate your thoughts on this.

rdcwayx · June 14, 2011, 9:06pm

With your new request, awk code can be more shorter.

awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file1 file2

awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file2 file1

After run the awk commands, you will get three files:

file1.diff                          # exist in file1, but not in file2
file2.diff                          # exist in file2, but not in file2
same.txt                         # exist in both files.

filter · June 15, 2011, 2:29am

Hi rdcwayx,

Thanks a lot for providing the awk script.
As you said the awk looks simpler and shorter.

I have ran the two awk scripts that you have provided to me for the two different files and have generated the two same.txt and same1.txt files.

awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same.txt"}}' File1 File2

File1.diff --> Records that exists in File1 but not in File2
same.txt --> Records that exists in both the files

awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same1.txt"}}' File1 File2

File2.diff --> Records that exists in File2 but not in File1
same1.txt --> Records that exists in both the files'

Here,
The file size of the same.txt and same1.txt is different.

--> 506108009 Jun 15 01:50 same.txt

--> 505878904 Jun 15 01:52 same1.txt

So, is there anyway that we can capture the records if the Primary Keys ($38 and $82) of both the files matches but the data in the other columns is not the same.

To solve the above issue, I thought once we have the files same.txt and same1.txt we should compare the number of characters in the a line in File1 with number of characters in a line in file2 , if there is a change then write to a file else discard (do nothing or leave it)

Could you please share any ideas to solve the above issue. It would be really grateful.