Remove duplicates in flat file

samjoshuab · December 10, 2013, 8:30am

Hi all,
I have a issues while loading a flat file to the DB. It is taking much time.
When analyzed i found out that there are duplicates entry in the flat file.
There are 2 type of Duplicate entry.
1) is entire row is duplicate. ( i can use sort | uniq) to remove the duplicated entry.
2) the PK which are forming the composite columns are same for 2 records , but the other columns are different which is also rejected and only one is getting loaded. PFB an example for the same.
My Pk are 1 , 4, 6, 8 from the flat file which is going to be loaded into the DB.
Column names : 1 2 3 4 5 6 7 8 9 10
Records 1 a b c d e f g h i j
records 2 a k l d m f n h o p
So since my PK are alone same and the rest is also different the Loader is ommiting those records. Can you tell me a script by which i can omit the record 2.
Please help.. We are in brink of issues to be fixed before tomorrow evening.
Thanks in advance
Sam

RudiC · December 10, 2013, 8:40am

Try

sort -u -k1,1 -k4,4 -k6,6 -k8,8 file
a b c d e f g h i j

or use an awk solution of which many have been posted in here in the recent past.

samjoshuab · December 11, 2013, 7:33am

Thanks rudic. Will give it a try.
If there is any other options please let me know..

---------- Post updated 12-11-13 at 07:33 AM ---------- Previous update was 12-10-13 at 08:44 AM ----------

Rudic,

I have tried out. But in my flat file there is one more disadvantage. It is not seperated by " " between records , but by "|"
example is

cat sample.txt
a|b|c|d|e|f|g|h|i|j
k|l|m|n|o|p|q|r|i|t
a|l|c|n|e|p|g|r|i|t

Please suggest any solutions for this..
Your solution for the same was effective. But since it is seperated with "|" what i need to do ??

Thanks
Sam

pamu · December 11, 2013, 7:37am

try

sort -u -t "|" -k1,1 -k4,4 -k6,6 -k8,8 file

jethrow · December 15, 2013, 1:16am

awk -F'|' '{k=$1"|"$4"|"$6"|"$8} !(k in keys) {print; keys[k]++}' file