How to delete a duplicate line and original with sed.

I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data. My final task requires me to find lines that have duplicates present then delete not only the duplicate but the original as well. The script will be used in a windows environment so I am using GNU sed. Below is a sample of the data:

180222,1,7.3,1Z0E947E0353634,9.49,UPAC
180223,1,7.3,1Z0E947E0373254,9.49,UPAC
180224,1,7.3,1Z0E947E0371556,8.33,UPAC
180222,1,7.3,1Z0E947E0353634,9.49,UPAC

In this example the first and last lines are duplicates and I would like to delete them both. I have been searching for several days and have not been able to figure out how to achieve this :wall:. Unfortunately I am short on time and would greatly appreciate any help possible. Thanks.

Need Awk

awk 'NR==FNR{a[$0]++;next} a[$0]<2' infile infile

If you can sort data then to show:

sort | uniq -d < YOURFILE

To remove:

sort | uniq -u < YOURFILE

There are sort and uniq in MSYS or Cygwin or GNU utils for Windows.

1 Like

Thanks for responding so quickly. rdcwayx, I downloaded and installed awk however, my output file comes up blank. using target file in place of first "infile" and destination file in place of second "infile". Did I misunderstand that part?

no, the same file is read two times.

and I test the code in Solaris, still get right output.

Do you download the latest gawk version?

This looks more simple.

but I have to change the command to:

sort YOURFILE |uniq -u
1 Like

Thank you both. I will try again in the am and let you know how it goes.

Yes. I'm stupid. I tested like this

cat YOURFILE | sort | uniq -u

but then changed when posting. :frowning:

Sorry it took so long. I have tried both techniques but I am unable to get them to work. I put the GNUWin directory path in the windows $PATH and the version of gawk is from 10/2003. However, I have been informed of 2 changes in the file. Apparently there will be instances where a duplicate is acceptable so an additional column has been added. If the last column in the line is Y then that line and the original are to be removed.

"180222","1","8.5","1Z0E947E0355363450","9.49","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180225","1","3.9","1Z0E947E0355178080","4.77","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180222","1","8.5","1Z0E947E0355363450","9.49","UPAC","Y"

As you can see the because the last line has Y then it and the first line are to be removed while lines 2 and 4 are to remain. Thanks again.

---------- Post updated at 12:19 PM ---------- Previous update was at 12:06 PM ----------

I forgot to add that when I run either technique I dont receive any errors of any type only blank output to the file. Below are examples of how I set it up.

sort C:\Users\username\Desktop\TestingFile\myfile1.csv |uniq -u > C:\Users\username\Desktop\TestingFile\myfile2.csv
gawk "NR==FNR{a[$0]++;next} a[$0]<2" C:\Users\username\Desktop\TestingFile\myfile1.csv C:\Users\username\Desktop\TestingFile\myfile1.csv >> C:\Users\username\Desktop\TestingFile\myfile2.csv

awk command with single quota only.

gawk 'NR==FNR{a[$0]++;next} a[$0]<2' 

And you need provide your expect output. I still really don't understand.

Sorry about that. I forgot to post the expectation:

"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180225","1","3.9","1Z0E947E0355178080","4.77","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"

In regards to the single quotes, I found in MS-DOS that it requires double quotes otherwise I get the following error:

"The system cannot find the file specified."

I thought it was my path, however, it was correct.