Check to identify duplicate values at first column in csv file

avikaljain · April 25, 2013, 5:08am

Hello experts,

I have a requirement where I have to implement two checks on a csv file:

Check to see if the value in first column is duplicate, if any value is duplicate script should exit.
Check to verify if the value at second column is between "yes" or "no", if it is anything else script should exit.

My input file looks like:
hiring,no
system,yes
hiring,yes
quota,no

OS is solaris.

I have been trying to implement/list my first requirement using awk but without any success, i tried this but there is no output:

awk 'x[$1]++ == 1 { print $1 " is duplicated"}' FILENAME

awk'x[$1]++FS=","

is not working either, since above file has hiring at two places script should come out.

Please advise.

MadeInGermany · April 25, 2013, 5:45am

You need to exit after the print .

hanson44 · April 25, 2013, 5:55am

$ cat input
hiring,no
system,yes
hiring,yes
quota,no
quota,maybe

$ sort input | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","
hiring is duplicated
quota is duplicated

$ awk '$2 != "yes" && $2 != "no" { print $2 " on line " NR " is not yes/no"}' FS="," input
maybe on line 5 is not yes/no

In a shell script, save the output from each awk command to a file, and use [ -s file ] to determine whether to exit the script or not.

Yoda · April 25, 2013, 9:14am

Perform pre-increment and check if greater than 1 to identify duplicates:

awk -F, ' ++A[$1] > 1 { print $1 "is duplicate"; exit 1 } ' file

avikaljain · May 9, 2013, 5:20am

Thank you to both of you hanson44 and Yoda, I used both the utilities in my script and they are working absolutely file. Thank you again.