Help with remove duplicate content and only keep the first content detail

Input

data_10 SSA
data_2 TYUE
data_3 PEOCV
data_6 SSAT
data_21 SSA
data_19 TYUEC
data_14 TYUE
data_15 SSA
data_32 PEOCV
.
.

Desired Output

data_10 SSA
data_2 TYUE
data_3 PEOCV
data_6 SSAT
data_19 TYUEC
.
.

From the above data, if the data in column two is same (eg. data_10, data_21, and data_15 all got SSA), I would only keep the data which appear first (eg. keep data_10 SSA, remove data_21 SSA, and data_15 SSA)
Thanks.

cat input_file | cut -f2 | uniq | while read line
do
    grep "$line" input_file | head -1 >> output_file
done

awk '{if(!a[$2]) print;a[$2]++;}' inPutfile

1 Like

Hi ROHON,
I just try it out.
It seems like can't get desired output result?
Thanks.

---------- Post updated at 05:14 AM ---------- Previous update was at 05:05 AM ----------

Thanks for your awk command.
It able to remove the duplicate line in column two successfully.
Unfortunately, its (duplicate data in column) respectively column one detail still keep at the data?

Didn't get you. If you are looking for a different output, pls post expected output

cat input_file | cut -f2 | uniq | while read line
do
   grep " ${line}$" input_file | head -1 >> output_file
done

Hi singh,

I just edit my question.
Hopefully it is more clear now.
Thanks for your advice.

Or even:

awk '!_[$2]++' infile

To the OP: please elaborate more on how the output from anurag.singh's command is wrong.

1 Like

I believe command in post #3 is doing the same.
@radoulov, a shorter/better command.

1 Like

Yes,
I just wanted to show that you can post increment and check with a single expression :slight_smile:

1 Like