Uniq code in sorted order

irfanmemon · November 19, 2012, 10:07pm

Hi All,
A small question:

I have a file check.txt which looks like below:

KRPROD2,2012-11-20 08:46:50:408,ODK3325688102
KRPROD2,2012-11-20 08:47:35:289,ODK3325688102
KRPROD2,2012-11-20 08:47:35:446,ODK3325688102
KRPROD2,2012-11-20 08:48:32:973,ODK3325689120
KRPROD2,2012-11-20 09:29:17:833,ODK3325689120
KRPROD2,2012-11-20 09:29:17:912,ODK3325689120

I want to get below output

KRPROD2,2012-11-20 08:46:50:408,ODK3325688102
KRPROD2,2012-11-20 08:48:32:973,ODK3325689120

which is first instance of field3 in the file. Please help

elixir_sinari · November 19, 2012, 10:09pm

awk -F, '!c[$3]++' file

irfanmemon · November 19, 2012, 10:16pm

Thanks so much...it works.
Can you please help to explain as i m new bie in unix

Don_Cragun · November 19, 2012, 10:45pm

awk -F, '!c[$3]++' file

The -F, sets the field separator to a <comma>.
The c[$3] is an array indexed by the contents of field 3. An unintialized variable in awk has value 0 (or empty string) depending on context).
The first time it is evaluated !c[$3] (NOT)(value of c[$3]), is (NOT)zero which evaluates to TRUE. When the value is TRUE the action associated with this awk statement is performed. Since there is no action given, the default action (print the line) is performed.
Then the ++ increments the value of c[$3] so that when this test is performed again on lines with the same 3rd field, the test !c[$3] will evaluate to FALSE because (NOT)(positive integer value) evaluates to zero. When the test evaluates to FALSE, the action associated with this line is not performed so subsequent lines with the same value in the 3rd field are not printed.
This test is performed once for each line in the file in order from beginning to end.

irfanmemon · November 19, 2012, 10:48pm

Thanks for detailed explanation

But when I am using below line/command its giving me error;

ssh ${fmsServerUserName}@${fmsServerName} "awk -F, '!c[$3]++' 10_FMS_CRXtoFMS.csv" >> 10_FMS_CRXtoFMS.csv

Error:
awk: !c[]++
awk:    ^ syntax error
awk: fatal: invalid subscript expression

Please help

subramanian · November 19, 2012, 11:13pm

Hi
escape ! and $ chars. it will work.

ssh ${fmsServerUserName}@${fmsServerName} "awk -F, '\!c[\$3]++' 10_FMS_CRXtoFMS.csv" >> 10_FMS_CRXtoFMS.csv

Don_Cragun · November 19, 2012, 11:27pm

By putting the !c[$e]++ in a double quoted string (i.e., "awk -F, '!c[$3]++' 10_FMS_CRXtoFMS.csv" ), you had the shell expand $3 instead of awk. Given the error message, it would appear that the shell you were executing when you invoked this command either didn't have three positional parameters or the third positional parameter at that time was an empty string.

Then when you invoked awk, the script that it was given was '!c++' and you got the syntax error from awk.

I don't use ssh much, but adding the backslash escape as shown in red in the following should get rid of the syntax error for you:

ssh ${fmsServerUserName}@${fmsServerName} "awk -F, '!c[\$3]++' 10_FMS_CRXtoFMS.csv" >> 10_FMS_CRXtoFMS.csv

It isn't clear to me whether the two references to 10_FMS_CRXtoFMS.csv are two references to the same file or references to different files with the same name on different servers. If you are appending output to a file that you're using for input, bad things are very likely to happen.

irfanmemon · November 19, 2012, 11:41pm

Thanks Subramanian

But I am still getting an error:

awk: \!c[$3]++
awk: ^ backslash not last character on line

---------- Post updated at 11:41 PM ---------- Previous update was at 11:28 PM ----------

Hi Don,

Its a same file, I want to remove the duplicates from the file & want to keep that in the same file 10_FMS_CRXtoFMS.csv.

Is there something wrong or how i can use then.

Don_Cragun · November 20, 2012, 2:20am

OK. Let's forget about the ssh complications and go back to basics. Essentially, you have the command:

awk -F, '!c[$3]++' 10_FMS_CRXtoFMS.csv >> 10_FMS_CRXtoFMS.csv

This reads the file 10_FMS_CRXtoFMS.csv and adds the lines in that file that had different values in the 3rd field to the end of the file. It does not throw away the original contents of the file.

If you change the command to:

awk -F, '!c[$3]++' 10_FMS_CRXtoFMS.csv > 10_FMS_CRXtoFMS.csv

you will empty the file named 10_FMS_CRXtoFMS.csv and then add any unique 3rd column values to the file (but since you emptied the file before calling awk, there aren't any lines in the file and you end up with an empty file).

Even if it did do what you thought it was doing, you still wouldn't want to do that. If your awk script fails for some reason, you will destroy your input file and have no backup. The safer way to handle something like this is:

awk -F, '!c[$3]++' 10_FMS_CRXtoFMS.csv > tmp$$.csv && mv tmp$$.csv 10_FMS_CRXtoFMS.csv

This writes the results to a temporary file and then moves the temporary file back to your original file's name if and only if awk completed successfully. (If awk fails, you will have the diagnostic messages awk prints, your unchanged input file, and the results awk produced before it failed in the temp file to debug the problem and fix it without losing any data. Using $$ in the file name allows you to sue the script to concurrently process other files without them interfering with each other. In POSIX conforming shells, $$ expands to the process ID of the shell creating the file.)

There are other issues to consider (and other ways to do this safely) if your input file has multiple hard links, but I'm assuming that isn't an issue for now.