awk compare files

rakesh_411 · January 20, 2015, 5:59pm

I have a below requirement and trying to compare the files using awk

File 1 - Already stored on a prev day

 id   | text       | email id 
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt
657342|this is line 3 | def@sample.txt

File 2 - Arrived today

 id   | text       | email id 
---------------------------------
89564|this is line 4 | xyz@sample.txt
657342|this is line 3 | def@sample.txt
985384|this is line 2 | abc@sample.txt

Requested output

id   | text        | email id       | operation 
-------------------------------------------------
89564|this is line 4  | xyz@sample.txt | modified 
985384|this is line 2 | abc@sample.txt | deleted
657342|this is line 3 | def@sample.txt | inserted

We are trying to use hadoop but however because with pseudo database hive we don't have insert or update operation , hence we are trying to generate the operation code at the shell scripting .

I tried a code as below but it does not give me a desired ouput

 
 awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1])print a[$1],$0;else print "inserted", $0;}' OFS="|" file1 file2

Don_Cragun · January 21, 2015, 2:25am

I don't understand the logic behind your requested output.

For id 89564, the output line doesn't match either input line. (The text field in the output has an extra space that does not appear in File 2.)

For id 657342, both entries in the input files are identical. So why do you want to say it should be deleted???

For id 985384, both entries in the input files are identical. So why do you want to say it should be inserted???

rakesh_411 · January 21, 2015, 1:17pm

I am sorry for not stating my requirement properly , We are trying to accomplish incremental load of data by comparing two files

File 1

 id   | text       | email id 
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 14 | abc@sample.txt
657342|this is line 3 | def@sample.txt

File 2

 id   | text       | email id 
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt

I would like output by comparing above two files which should display below results
Desired Output:
-------------------

 id   | text       | email id 
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt | modified (Note :  second column got modified ) 
657342|this is line 3 | def@sample.txt | deleted

Here we are trying to compare two files and if we don't find a record in file 2 based on column 1 ( primary key ) we treat that entry as deleted and we generate a new file with its operational code(in this case deleted). Similarly we if we find record in both the files ( file 1 and file2) we would still have that record in file 3 ,and for modified if we see any column modifies in file 2 we add the entry as modified .

RudiC · January 21, 2015, 1:37pm

Please use code tags as required by forum rules!

Try

awk     'NR==FNR        {T[$1]=$0; next}
                        {printf "%s", $0}
         $1 in T        {if ($0 != T[$1]) printf "|modified."  
                         delete T[$1]
                         printf "\n"
                         next
                        }
                        {print  "|inserted"}  
         END            {for (t in T) print T[t], "deleted."} 
        ' FS="|" OFS="|" file1 file2
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt|modified.
885384|this is line 7 | abc@sample.txt|inserted
657342|this is line 3 | def@sample.txt|deleted.