How to delete a duplicate line and original with sed.

chino_1 · June 14, 2011, 10:47pm

I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data. My final task requires me to find lines that have duplicates present then delete not only the duplicate but the original as well. The script will be used in a windows environment so I am using GNU sed. Below is a sample of the data:

180222,1,7.3,1Z0E947E0353634,9.49,UPAC
180223,1,7.3,1Z0E947E0373254,9.49,UPAC
180224,1,7.3,1Z0E947E0371556,8.33,UPAC
180222,1,7.3,1Z0E947E0353634,9.49,UPAC

In this example the first and last lines are duplicates and I would like to delete them both. I have been searching for several days and have not been able to figure out how to achieve this :wall:. Unfortunately I am short on time and would greatly appreciate any help possible. Thanks.

rdcwayx · June 14, 2011, 11:07pm

Need Awk

awk 'NR==FNR{a[$0]++;next} a[$0]<2' infile infile

yazu · June 14, 2011, 11:12pm

If you can sort data then to show:

sort | uniq -d < YOURFILE

To remove:

sort | uniq -u < YOURFILE

There are sort and uniq in MSYS or Cygwin or GNU utils for Windows.

chino_1 · June 14, 2011, 11:51pm

Thanks for responding so quickly. rdcwayx, I downloaded and installed awk however, my output file comes up blank. using target file in place of first "infile" and destination file in place of second "infile". Did I misunderstand that part?

rdcwayx · June 15, 2011, 12:00am

no, the same file is read two times.

and I test the code in Solaris, still get right output.

Do you download the latest gawk version?

This looks more simple.

but I have to change the command to:

sort YOURFILE |uniq -u

chino_1 · June 15, 2011, 12:06am

Thank you both. I will try again in the am and let you know how it goes.

yazu · June 15, 2011, 12:13am

Yes. I'm stupid. I tested like this

cat YOURFILE | sort | uniq -u

but then changed when posting.

chino_1 · June 18, 2011, 12:19pm

Sorry it took so long. I have tried both techniques but I am unable to get them to work. I put the GNUWin directory path in the windows $PATH and the version of gawk is from 10/2003. However, I have been informed of 2 changes in the file. Apparently there will be instances where a duplicate is acceptable so an additional column has been added. If the last column in the line is Y then that line and the original are to be removed.

"180222","1","8.5","1Z0E947E0355363450","9.49","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180225","1","3.9","1Z0E947E0355178080","4.77","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180222","1","8.5","1Z0E947E0355363450","9.49","UPAC","Y"

As you can see the because the last line has Y then it and the first line are to be removed while lines 2 and 4 are to remain. Thanks again.

---------- Post updated at 12:19 PM ---------- Previous update was at 12:06 PM ----------

I forgot to add that when I run either technique I dont receive any errors of any type only blank output to the file. Below are examples of how I set it up.

sort C:\Users\username\Desktop\TestingFile\myfile1.csv |uniq -u > C:\Users\username\Desktop\TestingFile\myfile2.csv

gawk "NR==FNR{a[$0]++;next} a[$0]<2" C:\Users\username\Desktop\TestingFile\myfile1.csv C:\Users\username\Desktop\TestingFile\myfile1.csv >> C:\Users\username\Desktop\TestingFile\myfile2.csv

rdcwayx · June 20, 2011, 12:26am

awk command with single quota only.

gawk 'NR==FNR{a[$0]++;next} a[$0]<2'

And you need provide your expect output. I still really don't understand.

chino_1 · June 20, 2011, 9:07am

Sorry about that. I forgot to post the expectation:

"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"
"180225","1","3.9","1Z0E947E0355178080","4.77","UPAC","N"
"180223","1","7.3","1Z0E947E0357325461","6.53","UPAC","N"

In regards to the single quotes, I found in MS-DOS that it requires double quotes otherwise I get the following error:

"The system cannot find the file specified."

I thought it was my path, however, it was correct.