formatting a file and removing duplicates

kylle345 · October 5, 2011, 2:42pm

Hi,

I have a file that I want to change the format of. It is a large file in rows but I want it to be comma separated (comma then a space).

The current file looks like this:

HI, Joe, Bob, Jack, Jack

After I would want to remove any duplicates so it would look like this:

HI, Joe, Bob, Jack

Thanks

p.s. I want to thank corona688 and scrutinizer for the help. I got it to the , space part but now I am having trouble with repeats. So remove anything that appears more than once (while only keeping one copy).

Corona688 · October 5, 2011, 3:12pm

$ awk 'BEGIN { ORS=", " } !A[$1]++' < data
HI, Joe, Bob, Jack, 
$

First, we set ORS=", " so it prints ", " instead of \n for each "line".

The next is an expression telling it when to print. If it was '1', it'd print an item for each and every line of input. If it was 0, it'd print nothing at all.

The expression !A[$1]++ tells it to print the entire line only when the line's first field ($1) hasn't been found before. So the first time it looks for Jack, it sees that A["Jack"] is zero and prints. The ++ then adds 1 to that. Next time, it sees A["Jack"] is 1 and skips printing it.

Scrutinizer · October 5, 2011, 3:17pm

Hi, try this:

awk '!A[$1]++' ORS=', ' infile | sed 's/, $/\n/'

-same solution -