Remove duplicate lines, sort it and save it as file itself

refrain · April 11, 2015, 6:13am

Hi, all

I have a csv file that I would like to remove duplicate lines based on 1st field and sort them by the 1st field. If there are more than 1 line which is same on the 1st field, I want to keep the first line of them and remove the rest. I think I have to use uniq or something, but I still have no idea how to do it. And when I tried to use head and tail to sort, it doesn't work with my script. I just don't know why.

SourceFile,Airspeed,GPSLatitude,GPSLongitude,Temperature,Pressure,Altitude,Roll,Pitch,Yaw
/home/intannf/foto5/2015_0313_090651_219.JPG,0.,-7.77223,110.37310,30.75,996.46,148.75,180.94,182.00,63.92
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.73,996.46,148.76,181.00,181.95,63.96
/home/intannf/foto5/2015_0313_090323_155.JPG,0.,-7.77224,110.37312,30.73,996.46,148.76,181.01,181.92,63.82
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.73,996.46,148.76,181.03,181.98,63.73 -->remove this duplicate
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.73,996.46,148.75,181.06,182.09,63.64 -->remove this duplicate
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.73,996.46,148.75,181.14,182.08,63.63 -->remove this duplicate
/home/intannf/foto5/2015_0313_090142_124.JPG,0.,-7.77224,110.37312,30.73,996.46,148.75,181.13,182.06,63.87
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.72,996.46,148.75,181.20,182.08,63.91 -->remove this duplicate
/home/intannf/foto5/2015_0313_090710_225.JPG,0.,-7.77224,110.37312,30.72,996.46,148.75,181.19,182.10,63.68
/home/intannf/foto5/2015_0313_090710_225.JPG,0.,-7.77224,110.37312,30.72,996.46,148.76,181.25,182.09,63.36 -->remove this duplicate
/home/intannf/foto5/2015_0313_090628_212.JPG,0.,-7.77223,110.37310,30.72,996.47,148.67,181.09,181.91,63.87
/home/intannf/foto5/2015_0313_085942_087.JPG,0.,-7.77219,110.37317,30.76,996.47,148.71,181.12,182.17,63.78
/home/intannf/foto5/2015_0313_090717_227.JPG,0.,-7.77217,110.37315,30.77,996.48,148.66,181.06,182.21,63.87

SourceFile,Airspeed,GPSLatitude,GPSLongitude,Temperature,Pressure,Altitude,Roll,Pitch,Yaw
/home/intannf/foto5/2015_0313_085929_083.JPG,0.,-7.77224,110.37312,30.73,996.46,148.76,181.00,181.95,63.96
/home/intannf/foto5/2015_0313_085942_087.JPG,0.,-7.77219,110.37317,30.76,996.47,148.71,181.12,182.17,63.78
/home/intannf/foto5/2015_0313_090142_124.JPG,0.,-7.77224,110.37312,30.73,996.46,148.75,181.13,182.06,63.87
/home/intannf/foto5/2015_0313_090323_155.JPG,0.,-7.77224,110.37312,30.73,996.46,148.76,181.01,181.92,63.82
/home/intannf/foto5/2015_0313_090628_212.JPG,0.,-7.77223,110.37310,30.72,996.47,148.67,181.09,181.91,63.87
/home/intannf/foto5/2015_0313_090651_219.JPG,0.,-7.77223,110.37310,30.75,996.46,148.75,180.94,182.00,63.92
/home/intannf/foto5/2015_0313_090710_225.JPG,0.,-7.77224,110.37312,30.72,996.46,148.75,181.19,182.10,63.68
/home/intannf/foto5/2015_0313_090717_227.JPG,0.,-7.77217,110.37315,30.77,996.48,148.66,181.06,182.21,63.87

Please help me to figure it out. Thanks in advance.

Regards,
Intan

RudiC · April 11, 2015, 6:29am

We would know the failure reasons even less than you do as we don't see your results as you see them. Why shouldn't head and tail work for you? Did you run them individually, trying to figure out how they work and how they cooperate to give the results you need?

man sort would show you the -u option to keep only unique key values, although it is not guaranteed that those will be the respective first line that occurred. You'd need a compound statement with awk and sort to get what you want.

refrain · April 11, 2015, 7:11am

Hi RudiC,

When i run those script individually as Don Cragun wrote, it can work well. But when i try to put them in my whole script, like this:

..... (matching fields script using awk)
}' $first.csv $second.csv > $result.csv
(head -1 $result.csv && tail -n+2 $result.csv | sort) > debug.csv && cp debug.csv result.csv; rm -f debug.csv
.....

assume that before those code above, i have to input file of $first, $second and define a filename for $result.
Do you have another way to figure it out?

How about to remove duplicate lines? I have tried using this code, but i think something's missing.

sort -u -t, -k1 file

Thanks in advance.

Regards,
Intan

RudiC · April 11, 2015, 7:28am

Did you try to export the result variable as you are running the head ... sort in a subshell that does not inherit variables by default.
The key -k1 will use field 1 till the end of line as a key. Try -k1,1 .

Scrutinizer · April 11, 2015, 7:35am

Note: sort -t, -u -k1,1 will work with some sorts, but not every sort works that way.

An alternative is to use awk with sort without the -u option, like RudiC suggested:

awk -F, '!A[$1]++' | sort ...

Don_Cragun · April 11, 2015, 4:12pm

refrain:

Hi RudiC,

When i run those script individually as Don Cragun wrote, it can work well. But when i try to put them in my whole script, like this:
..... (matching fields script using awk)
}' $first.csv $second.csv > $result.csv
(head -1 $result.csv && tail -n+2 $result.csv | sort) > debug.csv && cp debug.csv result.csv; rm -f debug.csv
..... 
assume that before those code above, i have to input file of $first, $second and define a filename for $result.
Do you have another way to figure it out?

How about to remove duplicate lines? I have tried using this code, but i think something's missing.
sort -u -t, -k1 file
Thanks in advance.

Regards,
Intan

In addition to what Scrutinizer and RudiC have already pointed out...

There is a HUGE difference between $result.csv and result.csv unless somewhere in your script you also had a shell assignment statement like:
result=result

If you want our help debugging your script, you need to show us your script! (Not just bits and pieces that work fine when you run them separately, but don't work in your whole script.) Most of us aren't very good at guessing what:

..... (matching fields script using awk)

and:

.....

actually expand to in your script, but it obviously makes a huge difference in what your script will do.

Are you trying to remove all but the 1st line in your file for each unique field one value, and hoping that sort -u will do that for you? Or, are you trying to remove all but the 1st line in your file for each unique field one value and also need to sort the output?

What operating system (including release numbers) and shell are you using? Do you need to be able to run this script only on that operating system, or are you trying to write portable code that will work on any UNIX or Linux system?

If you have a problem to solve, stop and think about what the problem is. Describe the entire problem. Describe your inputs. Describe your desired outputs.

Piecewise refinement is great when you've got a big problem to solve, but if you don't know your end target when you start, a lot of those pieces may be wasted since they won't lead to your final goal.

Please help us help you. Tell us in detail, about your inputs, your outputs, and the code you've tried to get to your goal.

refrain · April 13, 2015, 2:09am

Hi, all

Finally i have figured how to deal with this problem. I have edited Don Cragun's script. This is my script and it works well with my whole script.

(head -1 $result && tail -n+2 $result | sort) > $$.csv && cp $$.csv $result.csv; rm -f $$.csv; rm -f $result

After sorting the field, then i remove the duplicate lines in the field. I used the script as Scrutinizer suggested to me. Here's my script.

awk -F, '!A[$1]++' $result.csv > $$.csv && cp $$.csv $result.csv; rm -f $$.csv

Both of scripts works well with my whole script. Thank you so much for helping me!

But i need your suggestion. Can i use both of those script as one script (like merge both of the script and make it one)? How to do it? Thanks in advance.

Regards,
Intan

Don_Cragun · April 13, 2015, 4:05am

As we said before, if it is important to choose the 1st line in your input file for lines with the same 1st field, sort -u is not guaranteed to do that. And, sort and sort -k1,1 are not guaranteed to keep lines with the same first field in the same order in the output file as they appeared in the output file. So, sorting and then using awk to choose the first line of those with the same 1st field won't work either.

And, in your earlier statements, you said you wanted the output to be stored in your input file; but the code you now have that you say works doesn't do that. Instead, it uses an input file named by the expansion of the shell variable $result , stores the sorted results in a file with the extension .csv added to the end of the name of the input file. And, whether or not processing was successful, it removes the input file.

Assuming that your input file is specified by $result and you want the output stored in that same file if processing is successful (and the original file left unchanged if there is an error), you might try something like:

#!/bin/ksh
result="file"
awk -F, '
NR == 1 {
	print
	next
}
!A[$1]++ {
	print | "sort"
}' "$result" > "$result.$$" && cp "$result.$$" "$result"; rm -f "$result.$$"

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . Although written and tested using the Korn shell, this script should work with any shell that uses basic Bourne shell syntax.

shamrock · April 13, 2015, 12:36pm

If you want awk to do the parsing & sorting try the below script...

awk -F\, '{
    if (NR == 1) print
    else !f[$1]++ && x[++i] = $0
} END {
    for (j=1; j<i; j++)
        for (k=1; k<(i-j+1); k++)
            if (x[k] > x[k+1]) {
                t = x[k]
                x[k] = x[k+1]
                x[k+1] = t
            }
    for (k=1; k<=i; k++)
        print x[k]
}' file