Generate tabular data based on a column value from an existing data file

himanish · January 17, 2013, 10:27pm

Hi,

I have a data file with :

01/28/2012,1,1,98995
01/28/2012,1,2,7195
01/29/2012,1,1,98995
01/29/2012,1,2,7195
01/30/2012,1,1,98896
01/30/2012,1,2,7083
01/31/2012,1,1,98896
01/31/2012,1,2,7083
02/01/2012,1,1,98896
02/01/2012,1,2,7083
02/02/2012,1,1,98899
02/02/2012,1,2,7083

I need to find the first occurring date and the last occurring date for a particular value in column 4 for each combination of column 2 and 3

For example the output from above data should be:

1,1,01/28/2012,01/29/2012,98955
1,1,01/30/2012,02/01/2012,98896
1,1,02/02/2012,99/99/9999,98899
1,2,01/28/2012,01/29/2012,7195
1,2,01/30/2012,99/99/9999,7083

because the value (column 4)=98955 for (column 2)=1 and (column 3)=1 first occurs on date 01/28/2012 and last occurs on date 01/29/2012, after which on 01/30/2012 it changes to 98896,so a new row is added.When a value is the last value in the file, the last occurance date is 99/99/9999.

Not sure whether i could make it clear but is unix shell scripting(with awk maybe) the best way to achieve this or does this need something like perl ?

Don_Cragun · January 18, 2013, 6:51pm

himanish:

Hi,

I have a data file with :
01/28/2012,1,1,98995
01/28/2012,1,2,7195
01/29/2012,1,1,98995
01/29/2012,1,2,7195
01/30/2012,1,1,98896
01/30/2012,1,2,7083
01/31/2012,1,1,98896
01/31/2012,1,2,7083
02/01/2012,1,1,98896
02/01/2012,1,2,7083
02/02/2012,1,1,98899
02/02/2012,1,2,7083
I need to find the first occurring date and the last occurring date for a particular value in column 4 for each combination of column 2 and 3

For example the output from above data should be:
1,1,01/28/2012,01/29/2012,98955
1,1,01/30/2012,02/01/2012,98896
1,1,02/02/2012,99/99/9999,98899
1,2,01/28/2012,01/29/2012,7195
1,2,01/30/2012,99/99/9999,7083
because the value (column 4)=98955 for (column 2)=1 and (column 3)=1 first occurs on date 01/28/2012 and last occurs on date 01/29/2012, after which on 01/30/2012 it changes to 98896,so a new row is added.When a value is the last value in the file, the last occurance date is 99/99/9999.

Not sure whether i could make it clear but is unix shell scripting(with awk maybe) the best way to achieve this or does this need something like perl ?

The first line of your expected output ends with 98955 , but that value never appears in your data file. I assume this was intended to be 98995 which does appear in your data file with corresponding dates.

I don't understand how you determined the output order of lines in your example output. The following script produces the same data as your expected output (except with 98955 replaced by 98995 ) but the lines are in a different order:

awk '
# cd -- Compare dates
# Usage: cd(date1, date2)
# DESCRIPTION:
# Compare two dates that are strings of the form "MM/DD/YYYY".
# RETURN VALUES:
#       < 0     date1 comes before date2
#       0       date1 == date2
#       > 0     date1 comes after date2
function cd(d1, d2, LOCAL, v1, v2) {
        if(d1 == d2) return 0
        split(d1, v1, "/")
        split(d2, v2, "/")
        return v1[3] v1[1] v1[2] < v2[3] v2[1] v2[2] ? -1 : 1
}
BEGIN { FS = OFS = ","
        ed = "99/99/9999"
}
{       k = $2 FS $3 FS $4
        if(c[k]++ == 0) { 
                d1[k] = d2[k] = $1
                f[k] = $2 FS $3
                e[k] = $4
        } else {if(cd($1, d1[k]) < 0) d1[k] = $1
                if(cd($1, d2[k]) > 0) d2[k] = $1
        }
}
END {   d2[k] = ed
        for(i in c) {
                if(c == 1) d2 = ed
                print f, d1, d2, e
        }
}' data

Note that your input data was sorted by date, but this script will correctly process data presented in any order. The only entry that matters is that the end date of the last entry input is treated specially in the output. If the input data is always sorted by date, this script can be greatly simplified. If you need a specific output order, please describe the order and I'll try again.

As always, if you are using a Solaris/Sun OS system, use /usr/xpg4/bin/awk or nawk, instead of awk.