Delete Duplicate line (not really) from the file

I need help in figuring out hoe to delete lines in a data file. The data file is huge. I am currently using "vi" to search and delete the lines - which is cumbersome since it takes lots of time to save that file (due to its huge size).

Here is the issue. I have a data file with the following data - seperated by "|"

fld1|xxx|yyy|zzz|aaa|bbb|ccc|
fld2|qqq|www|eee|rrr|ttt|yyy|
fld3|aaa|sss|ddd|fff|ggg|hhh|
fld4|zzz|xxx|ccc|vvv|bbb|nnn|
fld2|qqq|www|eee|rrr|ooo|yyy|

I want to remove the line which is almost duplicate which is line#5. Line # 2 and line #5 are almost duplicate but the fifth field is different.I need to search only on the 1st field of the record (which in this case is "fld2") and then delete the 2nd occurence of the same 1st field.

Can this be done ? if yes how? The file contains around 500K rows.

awk -F\| '{if(!y[$1]) print y[$1]=$0}' file
1 Like

Thanks ShamRock.

It works on the test file which i posted.

However when i tried it on my actual file of around 5.3 million rows, it stripped out 600K rows which is kind of wrong because when i load this into my database, it complains only for 3 rows. So ideally the difference between the original file and the new file (created by redirecting the awk output) should be 3. This 3 rows are stipped out but i am not sure why other rows were stripped out. I did a check for few and there were no duplicates for them in the original file.

I might be missing something - which i am investigating now. But can you explain your "awk" script? Or if i have to add one more field for checking - how do i check it in the awk script?

Do you get the same issue with this?

awk -F\| '!y[$1]++' file

To just see lines it would remove:

awk -F\| 'y[$1]++' file

ShamRock's suggestion worked fine.
I figured it out - it was not the first field that needed to be checked for duplicates but the second field. I changed ShamRock's awk script appropriately and it worked like a charm.

awk -F\| '{if(!y[$2]) print y[$2]=$0}' old_file > new_file