duplicate row based on single column

mitr · April 12, 2011, 5:39pm

I am a newbie to shell scripting ..

I have a .csv file. It has 1000 some rows and about 7 columns...

but before I insert this data to a table I have to parse it and clean it ..basing on the value of the first column..which a string of phone number type...

example below..

column 1                column2
(111)222-3333      1000
(222)333-4444      1000
(111)222-3333       2000
(333)444-5555      2000

here (111)222-3333 is considered duplicate and 2000 takes precedence over 1000
so I have to remove the row with values (111)222-3333 1000 ...How do I achieve this ??
any help is greatly appreciated.

Thank you,
M.

yinyuemi · April 12, 2011, 7:40pm

You can try the code before you sort your data by the column 1 first, colunm 2 (small to large) using excel .

 
awk 'NR==1{print;next}{a[$1]=$2}END{for(i in a) print i,a|"sort -r"}'
column 1 column2
(333)444-5555 2000
(222)333-4444 1000
(111)222-3333 2000

mitr · April 13, 2011, 10:01am

I cannot manipulate the Excel file.
It comes from a third party and we have to run the batch file to handle the data that they send before inserting into our DB.

---------- Post updated at 09:01 AM ---------- Previous update was at 08:57 AM ----------

tried some thing like this ..
awk '
{s[$1]++}
END {
for(i in s) {
if(s[i]>1) {
print i
}
}
}'

It wouldnt work.. It would consider only (111) as duplicates ..not the whole number..

So I changed to this ..

awk '
{s[($1)$2"-"$3]++}
END {
for(i in s) {
if(s[i]>1) {
print i
}
}
}'

still doesnt help ..its working as if its given
awk '
{s[$0]++}
END {
for(i in s) {
if(s[i]>1) {
print i
}
}
}'