I have a below requirement and trying to compare the files using awk
File 1 - Already stored on a prev day
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt
657342|this is line 3 | def@sample.txt
File 2 - Arrived today
id | text | email id
---------------------------------
89564|this is line 4 | xyz@sample.txt
657342|this is line 3 | def@sample.txt
985384|this is line 2 | abc@sample.txt
Requested output
id | text | email id | operation
-------------------------------------------------
89564|this is line 4 | xyz@sample.txt | modified
985384|this is line 2 | abc@sample.txt | deleted
657342|this is line 3 | def@sample.txt | inserted
We are trying to use hadoop but however because with pseudo database hive we don't have insert or update operation , hence we are trying to generate the operation code at the shell scripting .
I tried a code as below but it does not give me a desired ouput
I am sorry for not stating my requirement properly , We are trying to accomplish incremental load of data by comparing two files
File 1
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 14 | abc@sample.txt
657342|this is line 3 | def@sample.txt
File 2
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt
I would like output by comparing above two files which should display below results
Desired Output:
-------------------
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt | modified (Note : second column got modified )
657342|this is line 3 | def@sample.txt | deleted
Here we are trying to compare two files and if we don't find a record in file 2 based on column 1 ( primary key ) we treat that entry as deleted and we generate a new file with its operational code(in this case deleted). Similarly we if we find record in both the files ( file 1 and file2) we would still have that record in file 3 ,and for modified if we see any column modifies in file 2 we add the entry as modified .
awk 'NR==FNR {T[$1]=$0; next}
{printf "%s", $0}
$1 in T {if ($0 != T[$1]) printf "|modified."
delete T[$1]
printf "\n"
next
}
{print "|inserted"}
END {for (t in T) print T[t], "deleted."}
' FS="|" OFS="|" file1 file2
id | text | email id
---------------------------------
89564|this is line 1 | xyz@sample.txt
985384|this is line 2 | abc@sample.txt|modified.
885384|this is line 7 | abc@sample.txt|inserted
657342|this is line 3 | def@sample.txt|deleted.