I want to find the difference between two files, by checking only the headers (column names) and report if any new column is added in the latest file.
For Ex: If the file "declartion.txt has these columns
url;image;id;showcase_id;showcase_name
and the actual file "feed.txt" has a newly added column either in the middle or in the last like
url;image;id;showcase_id;colour;showcase_name
The declaration file which will have only the header (column names), but the "feed.txt" will have column names and the values for those columns under each column. Comparison is required only for the column names (header) and not any values.
After the comparison the shell script should should report the newly added column For Ex: "colour" in a file called "report_column_mis-match.txt". Once the report file is created I the shell script should further remove the new column which is added by the data provider along with it's values.
The data provider is adding columns each now and then not informing us in advance which is creating the problem while loading the file in to mysql database. since the values get loaded haywire when the new column added in the data file but the same is not added in the mysql table.
I finding this bit complicated since I am not a shell script expert. Please help.
I am still doing some research on this to get the clue.
Regards
Praveen
---------- Post updated at 04:46 AM ---------- Previous update was at 03:33 AM ----------
I am able to read the first line from feed.txt and compare it with declaration.txt and print the difference.
But I am still not able to find the position of the column and its values and remove it. Not sure how to handle multiple new columns and remove it, if found.
As explained above, my need is to find the new columns (in this case two new columns url and make) and remove them from the feed.txt along with it's respective values.
Basically I want to identify all the new columns in the feed.txt and delete them along with its values (which means finding column position for all the values) and deleting them so that the data can be loaded for the day without any issues.
Columns can be added anywhere in the data. It may be one or more columns.
Regards
Praveen
---------- Post updated at 08:07 AM ---------- Previous update was at 08:02 AM ----------
Hi RudiC,
Thanks for your proposal and help. Please correct me if I am wrong. When I tried to execute your proposal. It reported only the new column name but did not delete the new column and the associated values.
between two files the difference was, "two new columns "updated_at" and "created_at" (updated_at;created_at) were added in between by the data provider, without giving any notice", due to which the data loading got messed up one day. Similar problem is happening frequently.
So i have planned to alert when the new columns get added and at the same time load the data without any issues by removing the newly added columns and it's relevant values in each row. So that we can add the new columns in the database and also redefine in the declaration file for considering the newly added columns in a planned manner.
awk '
BEGIN { FS=OFS=";" }
NR==FNR {
# 1st file: build hd[]
for (i=1; i<=NF; i++) { hd[$i] }
next
}
FNR==1 {
# header of 2nd file: if in hd[] note that column in col[]
for (i=1; i<=NF; i++) {
if ($i in hd) { col } else { print "Column " i ": " $i > "mismatch.txt" }
}
}
{
# print the columns that are in col[]
sep=""
for (i=1; i<=NF; i++) {
if (i in col) { printf "%s%s", sep, $i; sep=OFS }
}
print ""
}
' declare.txt feed.txt
Thanks for your help. I tried your script as well.
It is doing the job like identifying the new column names and giving it in an output file "mismatch.txt" but it is not removing those new columns and its values from the feed.txt, which is still an issue.
To utilise the method that I've used you'll have to increase the number of column variables and loops to match your first input file. Which is more than the four or so that I was expecting, I have tested this and it will remove the data as required, so you can just increase the initial declarations and for loop count. You'll also have to change the field delimiter to a semicolon.
The files "a" and "b" should be substituted with your file names, the file "a" is your target file - this file determines the number of declared variables (1 for each column) it also determines the number of "for" loops (1 for each column). The file "b" is the file that has the extra fields/columns.
Please remember to change the delimiter and let me know how you get on.