I want to find the difference between two files, only for the header (column names)

Praveen_Pandit · September 10, 2014, 2:43am

Hi All,

I want to find the difference between two files, by checking only the headers (column names) and report if any new column is added in the latest file.

For Ex: If the file "declartion.txt has these columns

url;image;id;showcase_id;showcase_name

and the actual file "feed.txt" has a newly added column either in the middle or in the last like

url;image;id;showcase_id;colour;showcase_name

The declaration file which will have only the header (column names), but the "feed.txt" will have column names and the values for those columns under each column. Comparison is required only for the column names (header) and not any values.

After the comparison the shell script should should report the newly added column For Ex: "colour" in a file called "report_column_mis-match.txt". Once the report file is created I the shell script should further remove the new column which is added by the data provider along with it's values.

The data provider is adding columns each now and then not informing us in advance which is creating the problem while loading the file in to mysql database. since the values get loaded haywire when the new column added in the data file but the same is not added in the mysql table.

I finding this bit complicated since I am not a shell script expert. Please help.

Thanks in advance.

Regards
Praveen

Don_Cragun · September 10, 2014, 2:57am

What have you tried so far?

Praveen_Pandit · September 10, 2014, 5:46am

Hi Don Cragun,

I am still doing some research on this to get the clue.

Regards
Praveen

---------- Post updated at 04:46 AM ---------- Previous update was at 03:33 AM ----------

I am able to read the first line from feed.txt and compare it with declaration.txt and print the difference.

But I am still not able to find the position of the column and its values and remove it. Not sure how to handle multiple new columns and remove it, if found.

Please help.

Thanks & regards
Praveen

gull04 · September 10, 2014, 5:48am

Hi Praveen,

Just a couple of quick questions, does the column always appear in the same place along with it's data.

Can you show us an example of the data?

Regards

Dave

RudiC · September 10, 2014, 6:38am

Take this as a starting point:

awk -F\; 'NR==FNR               {split ($0, HD)
                                 DEC=$0
                                 next
                                }
          FNR==1 && $0==DEC     {exit}
          FNR==1                {for (i=1; i<=NF && $i==HD; i++);
                                 print "Column " i ": " $i > "mismatch.txt"}
                                {$i=""; sub (/;;/,";"); sub (/^;|;$/, "")
                                }
          1
        ' OFS=";" declartion.txt feed.txt

Praveen_Pandit · September 10, 2014, 7:11am

Hi Rudic,

But it is not listing the newly added column. If the above script is run, it is giving the output as

Column 1: url,dealerid,dealer_name,location,zip,make

in mismatch.txt

As explained above, my need is to find the new columns (in this case two new columns url and make) and remove them from the feed.txt along with it's respective values.

Thanks & Regards
Praveen

RudiC · September 10, 2014, 7:27am

Well, it does for the files you gave in your spec in post #1. With declartion.txt

url;image;id;showcase_id;showcase_name

and feed.txt

url;image;id;showcase_id;colour;showcase_name
url1;image1;id1;showcase_id1;colour1;showcase_name1
url2;image2;id2;showcase_id2;colour2;showcase_name2
url3;image3;id3;showcase_id3;colour3;showcase_name3
url4;image4;id4;showcase_id4;colour4;showcase_name4

it has stdout as

url;image;id;showcase_id;showcase_name
url1;image1;id1;showcase_id1;showcase_name1
url2;image2;id2;showcase_id2;showcase_name2
url3;image3;id3;showcase_id3;showcase_name3
url4;image4;id4;showcase_id4;showcase_name4

and mismatch.txt as

Column 5: colour

As explained above, it does what was specified, and you obviously changed the input file structure.

Praveen_Pandit · September 10, 2014, 7:36am

Hi Rudic,

Sorry for the confusion. I apologize for the same. Yes, I was trying with a different input file and what I specified above initially was an example.

But I still have a difficulty in deleting the column with its values after identifying the new column.

Please help.

Thanks & regards
Praveen

RudiC · September 10, 2014, 7:53am

Help with what?

Praveen_Pandit · September 10, 2014, 8:09am

Hi Rudic,

Basically I want to identify all the new columns in the feed.txt and delete them along with its values (which means finding column position for all the values) and deleting them so that the data can be loaded for the day without any issues.

Thanks & regards
Praveen

RudiC · September 10, 2014, 8:16am

For which I presented a proposal that works with the data you presented.

Praveen_Pandit · September 10, 2014, 9:07am

Hi Dave,

Columns can be added anywhere in the data. It may be one or more columns.

Regards
Praveen

---------- Post updated at 08:07 AM ---------- Previous update was at 08:02 AM ----------

Hi RudiC,

Thanks for your proposal and help. Please correct me if I am wrong. When I tried to execute your proposal. It reported only the new column name but did not delete the new column and the associated values.

Best Regards
Praveen

RudiC · September 10, 2014, 9:22am

OK. Please show your input file, output file and any error message that may occur.

gull04 · September 10, 2014, 11:41am

Hi,

If you want to delete the additional data from the new input file, what is the reason for capturing what the additional column headings were?

Regards

Dave

---------- Post updated at 04:41 PM ---------- Previous update was at 02:34 PM ----------

Hi,

You could try this, using awk - a and b are the file names I used for the test.

awk -F, '
FILENAME == "a" && FNR == 1 {
        col[$1] = 1
        col[$2] = 1
        col[$3] = 1
        col[$4] = 1
# print "saved col1 = " col[$1], $1
# print "saved col2 = " col[$2], $2
# print "saved col3 = " col[$3], $3
# print "saved col4 = " col[$4], $4
}
FILENAME != "a" && FNR == 1 {
        for ( i=1; i<=NF; i++)
        {
                if ($i in col)
                {
                        # print i " saving to col1"
                        col1 = i
                        i++
                        break
                }
        }
        for ( ;i<=NF;i++)
        {
                if ($i in col)
                {
                        # print $i " saving to col2"
                        col2 = i
                        i++
                        break
                }
        }
        for ( ;i<=NF;i++)
        {
                if ($i in col)
                {
                        # print $i " saving to col3"
                        col3 = i
                        i++
                        break
                }
        }
        for ( ;i<=NF;i++)
        {
                if ($i in col)
                {
                        # print $i " saving to col4"
                        col4 = i
                        i++
                        break
                }
        }
}
FILENAME != "a" && FNR > 1 {
        printf "%s,%s,%s,%s\n", $col1, $col2, $col3, $col4
}
' a b

If you need any further help I'll have to see your input files.

Regards

Dave

Praveen_Pandit · September 10, 2014, 1:07pm

Here is my input file "feed.txt" (first row is a header with column names. I have pasted dummy column values for 3 rows)

url;image;id;showcase_id;showcase_name;showcase_zip_code;vehicle_id;carmodel_id;make_id;external_mpn;carmodel_name;make_name;colour;colour_alias;colour_secondary_name;colour_secondary_alias;trim;catalog_price;customer_bonus;reprise_bonus;legal_notice;rebate_legal_notice;eco_bonus;price;customer_benefit_percent;options;code;name;matriculation_on;status;mileage;vehicle_acceleration_0_100kph;vehicle_air_conditioning;vehicle_body_type;vehicle_co2_emission_level;vehicle_combined_fuel_economy;vehicle_cylinders_count;vehicle_doors_count;vehicle_driven_wheels;vehicle_emission_standard;vehicle_engine_capacity;vehicle_extra_urban_fuel_economy;vehicle_fiscal_horse_power;vehicle_height;vehicle_kerbweight;vehicle_length;vehicle_max_load_capacity;vehicle_name;vehicle_primary_fuel_tank_capacity;vehicle_primary_fuel_type;vehicle_published_hp_metric;vehicle_seats_count;vehicle_secondary_fuel_type;vehicle_type;vehicle_luggage_capacity;vehicle_tvs_tax;vehicle_transmission_shiftable;vehicle_trim_name;vehicle_trim_code;vehicle_urban_fuel_economy;vehicle_vehicle_warranty_kms;vehicle_vehicle_warranty_months;vehicle_width_excluding_mirrors;vehicle_transmission_type;vehicle_engine_name;vehicle_engine_code;vehicle_rsi_id;image_1;image_2;image_3;image_4;image_5
url_1;image_1;id_1;showcase_id_1;showcase_name_1;showcase_zip_code_1;vehicle_id_1;carmodel_id_1;make_id_1;external_mpn_1;carmodel_name_1;make_name_1;colour_1;colour_alias_1;colour_secondary_name_1;colour_secondary_alias_1;trim_1;catalog_price_1;customer_bonus_1;reprise_bonus_1;legal_notice_1;rebate_legal_notice_1;eco_bonus_1;price_1;customer_benefit_percent_1;options_1;code_1;name_1;matriculation_on_1;status_1;updated_at_1;created_at_1;mileage_1;vehicle_acceleration_0_100kph_1;vehicle_air_conditioning_1;vehicle_body_type_1;vehicle_co2_emission_level_1;vehicle_combined_fuel_economy_1;vehicle_cylinders_count_1;vehicle_doors_count_1;vehicle_driven_wheels_1;vehicle_emission_standard_1;vehicle_engine_capacity_1;vehicle_extra_urban_fuel_economy_1;vehicle_fiscal_horse_power_1;vehicle_height_1;vehicle_kerbweight_1;vehicle_length_1;vehicle_max_load_capacity_1;vehicle_name_1;vehicle_primary_fuel_tank_capacity_1;vehicle_primary_fuel_type_1;vehicle_published_hp_metric_1;vehicle_seats_count_1;vehicle_secondary_fuel_type_1;vehicle_type_1;vehicle_luggage_capacity_1;vehicle_tvs_tax_1;vehicle_transmission_shiftable_1;vehicle_trim_name_1;vehicle_trim_code_1;vehicle_urban_fuel_economy_1;vehicle_vehicle_warranty_kms_1;vehicle_vehicle_warranty_months_1;vehicle_width_excluding_mirrors_1;vehicle_transmission_type_1;vehicle_engine_name_1;vehicle_engine_code_1;vehicle_rsi_id_1;image_1_1;image_2_1;image_3_1;image_4_1;image_5_1
url_2;image_2;id_2;showcase_id_2;showcase_name_2;showcase_zip_code_2;vehicle_id_2;carmodel_id_2;make_id_2;external_mpn_2;carmodel_name_2;make_name_2;colour_2;colour_alias_2;colour_secondary_name_2;colour_secondary_alias_2;trim_2;catalog_price_2;customer_bonus_2;reprise_bonus_2;legal_notice_2;rebate_legal_notice_2;eco_bonus_2;price_2;customer_benefit_percent_2;options_2;code_2;name_2;matriculation_on_2;status_2;updated_at_2;created_at_2;mileage_2;vehicle_acceleration_0_100kph_2;vehicle_air_conditioning_2;vehicle_body_type_2;vehicle_co2_emission_level_2;vehicle_combined_fuel_economy_2;vehicle_cylinders_count_2;vehicle_doors_count_2;vehicle_driven_wheels_2;vehicle_emission_standard_2;vehicle_engine_capacity_2;vehicle_extra_urban_fuel_economy_2;vehicle_fiscal_horse_power_2;vehicle_height_2;vehicle_kerbweight_2;vehicle_length_2;vehicle_max_load_capacity_2;vehicle_name_2;vehicle_primary_fuel_tank_capacity_2;vehicle_primary_fuel_type_2;vehicle_published_hp_metric_2;vehicle_seats_count_2;vehicle_secondary_fuel_type_2;vehicle_type_2;vehicle_luggage_capacity_2;vehicle_tvs_tax_2;vehicle_transmission_shiftable_2;vehicle_trim_name_2;vehicle_trim_code_2;vehicle_urban_fuel_economy_2;vehicle_vehicle_warranty_kms_2;vehicle_vehicle_warranty_months_2;vehicle_width_excluding_mirrors_2;vehicle_transmission_type_2;vehicle_engine_name_2;vehicle_engine_code_2;vehicle_rsi_id_2;image_1_2;image_2_2;image_3_2;image_4_2;image_5_2
url_3;image_3;id_3;showcase_id_3;showcase_name_3;showcase_zip_code_3;vehicle_id_3;carmodel_id_3;make_id_3;external_mpn_3;carmodel_name_3;make_name_3;colour_3;colour_alias_3;colour_secondary_name_3;colour_secondary_alias_3;trim_3;catalog_price_3;customer_bonus_3;reprise_bonus_3;legal_notice_3;rebate_legal_notice_3;eco_bonus_3;price_3;customer_benefit_percent_3;options_3;code_3;name_3;matriculation_on_3;status_3;updated_at_3;created_at_3;mileage_3;vehicle_acceleration_0_100kph_3;vehicle_air_conditioning_3;vehicle_body_type_3;vehicle_co2_emission_level_3;vehicle_combined_fuel_economy_3;vehicle_cylinders_count_3;vehicle_doors_count_3;vehicle_driven_wheels_3;vehicle_emission_standard_3;vehicle_engine_capacity_3;vehicle_extra_urban_fuel_economy_3;vehicle_fiscal_horse_power_3;vehicle_height_3;vehicle_kerbweight_3;vehicle_length_3;vehicle_max_load_capacity_3;vehicle_name_3;vehicle_primary_fuel_tank_capacity_3;vehicle_primary_fuel_type_3;vehicle_published_hp_metric_3;vehicle_seats_count_3;vehicle_secondary_fuel_type_3;vehicle_type_3;vehicle_luggage_capacity_3;vehicle_tvs_tax_3;vehicle_transmission_shiftable_3;vehicle_trim_name_3;vehicle_trim_code_3;vehicle_urban_fuel_economy_3;vehicle_vehicle_warranty_kms_3;vehicle_vehicle_warranty_months_3;vehicle_width_excluding_mirrors_3;vehicle_transmission_type_3;vehicle_engine_name_3;vehicle_engine_code_3;vehicle_rsi_id_3;image_1_3;image_2_3;image_3_3;image_4_3;image_5_3

Here is my declaration file containing the predefined header (columns)

url;image;id;showcase_id;showcase_name;showcase_zip_code;vehicle_id;carmodel_id;make_id;external_mpn;carmodel_name;make_name;colour;colour_alias;colour_secondary_name;colour_secondary_alias;trim;catalog_price;customer_bonus;reprise_bonus;legal_notice;rebate_legal_notice;eco_bonus;price;customer_benefit_percent;options;code;name;matriculation_on;status;mileage;vehicle_acceleration_0_100kph;vehicle_air_conditioning;vehicle_body_type;vehicle_co2_emission_level;vehicle_combined_fuel_economy;vehicle_cylinders_count;vehicle_doors_count;vehicle_driven_wheels;vehicle_emission_standard;vehicle_engine_capacity;vehicle_extra_urban_fuel_economy;vehicle_fiscal_horse_power;vehicle_height;vehicle_kerbweight;vehicle_length;vehicle_max_load_capacity;vehicle_name;vehicle_primary_fuel_tank_capacity;vehicle_primary_fuel_type;vehicle_published_hp_metric;vehicle_seats_count;vehicle_secondary_fuel_type;vehicle_type;vehicle_luggage_capacity;vehicle_tvs_tax;vehicle_transmission_shiftable;vehicle_trim_name;vehicle_trim_code;vehicle_urban_fuel_economy;vehicle_vehicle_warranty_kms;vehicle_vehicle_warranty_months;vehicle_width_excluding_mirrors;vehicle_transmission_type;vehicle_engine_name;vehicle_engine_code;vehicle_rsi_id;image_1;image_2;image_3;image_4;image_5

between two files the difference was, "two new columns "updated_at" and "created_at" (updated_at;created_at) were added in between by the data provider, without giving any notice", due to which the data loading got messed up one day. Similar problem is happening frequently.

So i have planned to alert when the new columns get added and at the same time load the data without any issues by removing the newly added columns and it's relevant values in each row. So that we can add the new columns in the database and also redefine in the declaration file for considering the newly added columns in a planned manner.

Thanks & regards
Praveen

MadeInGermany · September 10, 2014, 2:11pm

awk '
BEGIN { FS=OFS=";" }
NR==FNR {
  # 1st file: build hd[]
  for (i=1; i<=NF; i++) { hd[$i] }
  next
}
FNR==1 {
  # header of 2nd file: if in hd[] note that column in col[]
  for (i=1; i<=NF; i++) {
    if ($i in hd) { col } else { print "Column " i ": " $i > "mismatch.txt" }
  }
}
{
  # print the columns that are in col[]
  sep=""
  for (i=1; i<=NF; i++) {
    if (i in col) { printf "%s%s", sep, $i; sep=OFS }
  }
  print ""
}
' declare.txt feed.txt

Praveen_Pandit · September 11, 2014, 12:42am

Hi MadeInGermany,

Thanks for your help. I tried your script as well.

It is doing the job like identifying the new column names and giving it in an output file "mismatch.txt" but it is not removing those new columns and its values from the feed.txt, which is still an issue.

Regards
Praveen

gull04 · September 11, 2014, 3:10am

Hi,

To utilise the method that I've used you'll have to increase the number of column variables and loops to match your first input file. Which is more than the four or so that I was expecting, I have tested this and it will remove the data as required, so you can just increase the initial declarations and for loop count. You'll also have to change the field delimiter to a semicolon.

Not very elegant, but it will work.

Regards

Dave

Praveen_Pandit · September 11, 2014, 3:18am

Hi Dave,

Thanks for your help. Can you please let me know what is a & b ?

My input file is feed.txt and the columns I defined for comparison was in the file "declaration.txt" .

If I need to declare all the columns in the code itself then can I ignore "declaration.txt"?

Please advice.

Regards
Praveen

gull04 · September 11, 2014, 3:53am

Hi Praveen,

The files "a" and "b" should be substituted with your file names, the file "a" is your target file - this file determines the number of declared variables (1 for each column) it also determines the number of "for" loops (1 for each column). The file "b" is the file that has the extra fields/columns.

Please remember to change the delimiter and let me know how you get on.

Regards

Dave