Sort, sed, and zero padding date column csv bash scripting

sean1357 · July 20, 2013, 11:43am

Hello people,

I am having problem to sort, sed and zero padding of column in csv file.

7th column only.

Input of csv file:

1,2,3,4,5,6,4/1/2010 12:00 AM,8
1,2,3,4,5,6,3/11/2010 9:39 AM,8
1,2,3,4,5,6,5/12/2011 3:43 PM,8
1,2,3,4,5,6,12/20/2009 7:23 PM,8

Output:

1,2,3,4,5,6,12/20/2009,8
1,2,3,4,5,6,03/11/2010,8
1,2,3,4,5,6,04/01/2010,8
1,2,3,4,5,6,05/12/2011,8

Very thank you for you help.

Yoda · July 20, 2013, 1:07pm

How about using awk and sort?

awk -F, '
        {
                sub( / .*/, X, $7 )
                split ( $7, V, "/" )
                $7 = sprintf ( "%02d/%02d/%d", V[1], V[2], V[3] )
                print
        }
' OFS=, file | sort -t"," -k 7.9 -k 7.5 -k 7

sean1357 · July 20, 2013, 1:43pm

It works. Thank you very muh for your help. Now, I having another problem how to remove row from range date.

range from 04/01/2010 - 05/04/2013, but the problem is that 04/01/2010 or 05/04/2013 maybe doesn't existed in csv file. Is the a way that you just compare date of input?

Example:

1,2,3,4,5,6,03/02/2009,8   -> need to remove column
1,2,3,4,5,6,08/03/2009,8   -> need to remove column
1,2,3,4,5,6,04/01/2010,8   -> keep from here
....
....
1,2,3,4,5,6,05/01/2013,8   -> keep to end
1,2,3,4,5,6,05/11/2013,9  -> need to remove from this point

Yoda · July 20, 2013, 3:01pm

What is your SHELL and OS ?

sean1357 · July 20, 2013, 3:03pm

Thanks. I'm using Cygwin bash script on Windows 7. Is this will be a problem?

Don_Cragun · July 20, 2013, 3:08pm

yoda:

How about using awk and sort?

awk -F, '
   {
   sub( / .*/, X, $7 )
   split ( $7, V, "/" )
   $7 = sprintf ( "%02d/%02d/%d", V[1], V[2], V[3] )
   print
   }
' OFS=, file | sort -t"," -k 7.9 -k 7.5 -k 7

I'm a little bit confused by this. Yoda's code is usually right on, but the sort keys used here make the last two digits of the year to the end of the line the primary sort key, the last digit of the day to the end of the line the secondary sort key, and the start of the date field to the end of the line the tertiary sort key. (This seems to have worked with the sample data only because the 8th field in all of the input lines was identical and the last digit of the day was a constant on the two input lines with the same year.) I think the sort he intended to perform on the awk output was:

sort -t"," -k 7.7,7 -k 7.1,7.5

which makes the entire year the primary sort key and the month and day together as the secondary sort key.

With the new requirement to exclude a date range from the output, you could try something like:

awk -F, -v sd="04/01/2010" -v ed="05/04/2013" '
BEGIN { ss = substr(sd, 7) substr(sd,1,2) substr(sd,4,2)
        es = substr(ed, 7) substr(ed,1,2) substr(ed,4,2)
}
{       sub(/ .*/, X, $7)
        split($7, V, "/")
        cs = sprintf("%d%02d%02d", V[3], V[1], V[2])
        if(cs >= ss && cs <= es) next
        $7 = sprintf("%02d/%02d/%d", V[1], V[2], V[3])
        print
}' OFS=, input.csv | sort -t"," -k 7.7,7 -k 7.1,7.5

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk or /bin/awk .

With the input file:

1,2,3,4,5,6,4/1/2010 12:00 AM,1
1,2,3,4,5,6,3/11/2010 9:39 AM,2
1,2,3,4,5,6,5/12/2011 3:43 PM,3
1,2,3,4,5,6,12/20/2009 7:23 PM,4
1,2,3,4,5,6,03/02/2009 1:23 AM,5
1,2,3,4,5,6,08/03/2009 2:34 AM,6
1,2,3,4,5,6,04/01/2010 3:45 AM,7
1,2,3,4,5,6,05/01/2013 4:56 AM,8
1,2,3,4,5,6,05/11/2013 12:03 PM,9

the output produced is:

1,2,3,4,5,6,03/02/2009,5
1,2,3,4,5,6,08/03/2009,6
1,2,3,4,5,6,12/20/2009,4
1,2,3,4,5,6,03/11/2010,2
1,2,3,4,5,6,05/11/2013,9

which drops the lines shown in red in the input because they are in the date exclusion range.