Yearly Grouping of Data

I need some logic that would help to group up some records that fall between two dates:

Input Data

COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7      COL_8      COL_9    COL_10 COL_11 COL_12
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2015-12-15 975.73   ZZZ    P      147018.64
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2016-01-31 898.86   ZZZ    P      129018.66
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2016-02-29 788.81   ZZZ    P      110912.18
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-02-29 18106.48 ZZZ    P      110912.18
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-05-31 1652.2   ZZZ    P      55947.43
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-08-31 650.05   ZZZ    P      45500.00
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-09-20 500.15   ZZZ    P      37525.00
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-10-01 357.05   ZZZ    P      12385.00

We will pass VAR_DATE as a parameter to the script, which we will then use to work out the grouping from this value.

For each value in COL_4 (Grouping Column) we need to group up the records that are within a certain date range.

In the above example therefore, we would like to group records with the below logic:

COL_7 > VAR_DATE AND COL_8 <= VAR_DATE + 1 Year

Example Output (Year 1)

C  ABC  ABCD 3  ZZ  WLOA VAR_DATE   VAR_DATE + 1 YEAR SUM ALL VALUES ZZZ  P FINAL RECORD IN COL_12 BALANCE FOR GIVEN ID (COL_4)
C  ABC  ABCD 3  ZZ  WLOA 2015-12-01 2016-12-01   2663.40             ZZZ  P 110912.18

We then need to group up the values for the next year (Year 2)

We would then like to repeat the same, but for year 2, using the below logic:

COL_7 > VAR_DATE + 1 Year AND COL_8 <= VAR_DATE + 2 Year

Example Output (Year 2)

C  ABC  ABCD 3  ZZ  WLOA VAR_DATE + 1 YEAR VAR_DATE + 2 YEAR SUM ALL VALUES ZZZ  P FINAL RECORD IN COL_12 BALANCE FOR GIVEN ID (COL_4)
C  ABC  ABCD 3  ZZ  WLOA 2016-12-01   2017-12-01             21265.93       ZZZ  P 12385.00

We will need to keep doing this logic up to 5 years / groups but this amount of years could change so ideally the loop/amount of groups required needs to be dynamic/parameterised.

The final output would look like below, with the two new records generated from the above logic, appended to the end of the file:

COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 COL_10 COL_11 COL_12
C ABC ABCD 3 ZZ WLOA 2015-12-01 2015-12-15 975.73 ZZZ P 147018.64
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-01-31 898.86 ZZZ P 129018.66
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-02-29 788.81 ZZZ P 110912.18
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-02-29 18106.48 ZZZ P 110912.18
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-05-31 1652.2 ZZZ P 55947.43
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-08-31 650.05 ZZZ P 45500.00
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-09-20 500.15 ZZZ P 37525.00
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-10-01 357.05 ZZZ P 12385.00
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-12-01 2663.40 ZZZ P 110912.18 ** (New record from above logic - Year 1 ) **
C ABC ABCD 3 ZZ WLOA 2016-12-01 2017-12-01 21265.93 ZZZ P 12385.00 ** (New record from above logic - Year 2 ) **

As the above logic will be running over a large amount of records, I would assume AWK will be the most efficient solution to the above, however my experience of AWK is extremely limited, therefore I am unsure as to how to proceed with starting the above logic.

Here are ways to do it in shell and in awk :

#!/bin/ksh
VAR_DATE=${1:-2016-02-15}

year=${VAR_DATE%%-*}
mon_day=${VAR_DATE#*-}
VAR_DATE_PLUS_1YEAR="$((year + 1))-$mon_day"
VAR_DATE_PLUS_2YEAR="$((year + 2))-$mon_day"
VAR_DATE_PLUS_3YEAR="$((year + 3))-$mon_day"
VAR_DATE_PLUS_4YEAR="$((year + 4))-$mon_day"
VAR_DATE_PLUS_5YEAR="$((year + 5))-$mon_day"

echo 'From shell calculations:'
printf 'VAR_DATE is %s\n' "$VAR_DATE"
printf 'VAR_DATE_PLUS_1YEAR is %s\n' "$VAR_DATE_PLUS_1YEAR"
printf 'VAR_DATE_PLUS_2YEAR is %s\n' "$VAR_DATE_PLUS_2YEAR"
printf 'VAR_DATE_PLUS_3YEAR is %s\n' "$VAR_DATE_PLUS_3YEAR"
printf 'VAR_DATE_PLUS_4YEAR is %s\n' "$VAR_DATE_PLUS_4YEAR"
printf 'VAR_DATE_PLUS_5YEAR is %s\n' "$VAR_DATE_PLUS_5YEAR"

awk -v VAR_DATE="$VAR_DATE" '
BEGIN {	y = substr(VAR_DATE, 1, 4)
	VAR_DATE_PLUS_1YEAR = y + 1 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_2YEAR = y + 2 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_3YEAR = y + 3 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_4YEAR = y + 4 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_5YEAR = y + 5 substr(VAR_DATE, 5)
	print "From awk calculations:"
	printf "VAR_DATE_PLUS_1YEAR is %s\n", VAR_DATE_PLUS_1YEAR
	printf "VAR_DATE_PLUS_2YEAR is %s\n", VAR_DATE_PLUS_2YEAR
	printf "VAR_DATE_PLUS_3YEAR is %s\n", VAR_DATE_PLUS_3YEAR
	printf "VAR_DATE_PLUS_4YEAR is %s\n", VAR_DATE_PLUS_4YEAR
	printf "VAR_DATE_PLUS_5YEAR is %s\n", VAR_DATE_PLUS_5YEAR
}'

Hi Don,

Thanks for the above, that solution works well for generating the different years I require, but I also need a way to group these records as mentioned above.

We need to group records that fall between VAR_DATE and VAR_DATE +1 year and do some calculations on some of the columns i.e. Sum of all records (COL_9) and Final record value in COL_12

Any help would be appreciated.

What have you tried to solve this problem?

What are the names of the input and output files you want to process?

Give us more detail about your input file:

  1. What is the sort order for your input file?
  2. What is supposed to happen if the input date is something like 2015-01-01 with a line like the 3rd line in your sample input:
    C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-01-31 898.86 ZZZ P 129018.66
    where the apparent start date in COL_7 is within the range of 1 year, but the apparent end date in COL_8 is not in that same range?
  3. Your stated logic isn't clear about what constitutes the final record that determines how COL_12 is supposed to be set. Is it the last line found in the file, the line with the highest COL_7 value, the line with the highest COL_8 value, or something else?