Changing CSV files with date . Subtracting date by values

Hi All,

I have a CSV file which is as below. Basically I need to take the year column in it and find if the year is >= 20152 . If that is then I should subtract all values by 6. In the below example in description I am having number mentioned as YYWW so I need to subtract those by -5. Whereever I find the year I have ti subtract by -5. If the year is >201601 then I have to subtrct by -6. The year representation is 52 week. so if the week falls on 03 for example 201403 then the subtraction of -6 will yield 201347. I am planning to do in C++ ,not sure if this possible with awk or sed .

Year representation goes like 
201101
201103
..
..
..
201151
201152
201201
201202

Original

id,description,type,year,obj
994475,1832 +TRANS     1835 10/17/18,S,201835,P
994477,1836 + NOTAPP 1839 10/17/18,S,201839,1
828058,CONTROL 1452-1527,1552-1627,S,201627,OP
828059,1452-1527,1552-1627,S,201627,UU

Modified

id,description,type,year
994475,1820 +TRANS     1829 10/17/18,S,201829,P         ---------------------  Year is 2018  should be subtracted by 6
994477,1830 + NOTAPP 1833 10/17/18,S,201833 ,1         ---------------------  Year is 2018  should be subtracted by 6
828058,CONTROL 1436-1521,1546-1621,S,201621,OP  ---------------------  Year is 2016  should be subtracted by 6
828059,1447-1522,1547-1602 ,S,201622,UU ---------------------                Year is 2015  should be subtracted by 5

Sorry, for the most part i thought i understood your goal, but i am a bit confused:

Is that simply a typo or can the week numbers be either one or two digits? Is the value above in fact reading "201502" or "20152x" or could the second week in 2015 be both represented by "201502" or "20152"?

Either way, it is easily done in awk, but with different algorithms, obviously.

bakunin

Here is my try for the operation, perhaps a more shorter / better solution can be made ...

BEGIN {
FS=","
mweek=52
}
NR > 1{
year=substr($(NF-1),1,4)
week=substr($(NF-1),5,6)
variance=( year > 2015 ) ? 6 : 5

if ( int(week) == variance ) { week=mweek ; year=year - 1 ; sub($(NF-1),year week,$(NF-1)) }
else if ( int(week) < variance ) { week=mweek - (variance - week) ; year=year-1 ; sub($(NF-1),year week,$(NF-1)) }
else { week=sprintf("%02d",week - variance) ; sub($(NF-1),year week,$(NF-1)) }
} 1

Save as program.awk and run as awk -f program.awk input
Year is hardcoded and if variance needs to change, so does zero padding in week variable declaration.

Your input seems the have year on field $5 or $6 which varies on lines, but always on $(NF-1) or a one field before last ?

Hope that helps
Regards
Peasant.

First my apologize for not putting the actual data.

ORIGINAL
994475;1832 +  S PP1835 10/17/18;S;P201835;115;N;4,4;M;0;xx994475;*;BA7005;10/17/2018 16:48
994477;1836 +  S PP1839 10/17/18;S;P201839;115;N;4,4;M;0;xxh994477;*;BA7005;10/17/2018 16:48
994479;CONTROL 1452-1527,1552-1627;P201527;115;N;4,4;M;0;RDHSYNDCT_12_1515FF_0706;*;B7005;10/17/2018 16:49


EXPECTED
994475;1826+  S PP1829 10/17/18;S;P201829;115;N;4,4;M;0;xx994475;*;BA7005;10/17/2018 16:48  ---> Subtract column  2,4,10 by -5 if it is 2015 lesser or by -6 if that is 2016 or greater
994477;1830 +  S PP1833 10/17/18;S;P201833;115;N;4,4;M;0;xxh994477;*;BA7005;10/17/2018 16:48  ---> Subtract column  2,4,10 by -5 if it is 2015 lesser or by -6 if that is 2016 or greater
994479;CONTROL 1447-1522,1547-1622;S;P201522;115;N;4,4;M;0;RHS_12_1510FF_0706;*;B7005;10/17/2018 16:49  ---> Subtract column  2,4,10 by -5 if it is 2015 lesser or by -6 if that is 2016 or greater

The Year I should take will always be on column 4. in the above scenario. We are having as P201835. The column I need check the do the subtraction is 2,4,10

I tried changing the awk code like this and ran. It give me the same output as original and no change

BEGIN {
FS=";"
mweek=52
}
NR > 1{
year=substr($(NF-9),1,4)
week=substr($(NF-9),5,6)
variance=( year > 2015 ) ? 6 : 5

if ( int(week) == variance ) { week=mweek ; year=year - 1 ; sub($(NF-9),year week,$(NF-9)) }
else if ( int(week) < variance ) { week=mweek - (variance - week) ; year=year-1 ; sub($(NF-9),year week,$(NF-9)) }
else { week=sprintf("%02d",week - variance) ; sub($(NF-9),year week,$(NF-9)) }
} 1

I'm not sure i follow.
Input now is also inconsistent, with first two rows having 13 fields and last having 12.

Now you say you require fields 2,4 and 10, but on expected output you changed only field 4 for the first two lines and field 5 for the last line.

I have no idea what to do with 2,10, but we can work with 4 and 5 using awk match and regex.
Will it be a good guess now, or are we missing some input again ?

BEGIN {
FS=";"
mweek=52
}
#NR > 1 { # if we do not have header in our input ...
{
match($0,/P[12][0-9][0-9][0-9][0-5][0-9]/)
dw=substr($0,RSTART+1,RLENGTH-1)
year=substr(dw,1,4)
week=substr(dw,5,6)
variance=( year > 2015 ) ? 6 : 5

if ( int(week) == variance ) { week=mweek ; year=year - 1 ; sub(dw,year week,$0) }
else if ( int(week) < variance ) { week=mweek - (variance - week) ; year=year-1 ; sub(dw,year week,$0) }
else { week=sprintf("%02d",week - variance) ; sub(dw,year week,$0) }
} 1

Be sure other fields in line do not match P<year week regex>, since we are using $0

Please, read about NF, RSTART, RLENGTH here (for gawk, but is available on other awk(s) as well ) :
ftp://ftp.gnu.org/pub/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_11.html#SEC110

Regards
Peasant.

1 Like

I missed the column in the input.

Input with columns
994475;1832 +  S PP1835 10/17/18;S;P201835;115;N;4,4;M;0;xx994475;*;BA7005;10/17/2018 16:48
994477;1836 +  S PP1839 10/17/18;S;P201839;115;N;4,4;M;0;xxh994477;*;BA7005;10/17/2018 16:48
994479;CONTROL 1452-1527,1552-1627;S;P201527;115;N;4,4;M;0;RHS_12_1515FF_0706;*;B7005;10/17/2018 16:49

I changed all column in the expected output. If you see in column 4 I have change P201835 to P201829 . For column 2 and 10 . It is like changing the YYYY , If you look I changed 1832 to 1826 . And in Column 10 for last row I changed from RHS_12_1515FF_0706
to RHS_12_1510FF_0706

Ran it and created a same ouput as original

(K>)  awk -f change.awk input.csv
994479;CONTROL 1452-1527,1552-1627;S;P201527;115;N;4,4;M;0;RHS_12_1515FF_0706;*;B7005;10/17/2018 16:49
(K>) cat input.csv
994479;CONTROL 1452-1527,1552-1627;S;P201527;115;N;4,4;M;0;RHS_12_1515FF_0706;*;B7005;10/17/2018 16:49
          

All the subtraction done on fields ($2,$10) depends on year on field $4, the variance defined.
If no input on $2 matches string PP<number>, $2 is printed as is, unchanged.
If no input on $10 matches string RHS, $10 is printed as is, unchanged.

Hopefully that's it :

BEGIN {
OFS=FS=";"
mweek=52
}

NR > 1 {
match($4,/P[12][0-9][0-9][0-9][0-5][0-9]/)
dw=substr($4,RSTART+1,RLENGTH-1)
year=substr(dw,1,4)
week=substr(dw,5,6)
variance=( year > 2015 ) ? 6 : 5

if ( match($10,/RHS/) ) {
	split($10,g,"_")
	u=g[1]"_"g[2]"_"int(g[3]) - variance"FF_"g[4]
	sub($10,u,$10)
	}

if ( match($2,/PP[0-9]+/) ) {
	a="PP"substr($2,RSTART+2,RLENGTH-2) - variance
	sub(substr($2,RSTART,RLENGTH),a,$2)
	}

if ( int(week) == variance ) {
	week=mweek ; year=year - 1 ; sub(dw,year week,$0)
	}
else if ( int(week) < variance ) {
	week=mweek - (variance - week) ; year=year-1 ; sub(dw,year week,$0)
	}
else {
	week=sprintf("%02d",week - variance) ; sub(dw,year week,$0)
	}
} 1

Regards
Peasant.

1 Like

It worked partially

(K>)  awk -f change.awk input.csv
id,description,type,year
994475;1832 +  S PP1829 10/17/18;S;P201829;115;N;4,4;M;0;xx994475;*;BA7005;10/17/2018 16:48
994477;1836 +  S PP1833 10/17/18;S;P201833;115;N;4,4;M;0;xxh994477;*;BA7005;10/17/2018 16:48
994479;CONTROL 1452-1527,1552-1627;S;P201522;115;N;4,4;M;0;RHS_12_1510FF_0706;*;B7005;10/17/2018 16:49


in the above output on row 1, column 2 it is still 1832 + S PP1829 10/17/18 . It updated the part after + but 1832 should be 1816 that is not changed
In row 2 .column2 - It did't update update the same as above 1836 should be as 1830
in row 3 column 2 - ; CONTROL 1452-1527,1552-1627 not record is updated. It should be ; CONTROL 1446-1521,1545-1621

Is the above a mistake, since you posted subtraction bigger then 5 or 6 and above is 16 ?

Well you really gave me a challenge for one who doesn't code like this on a daily basis :smiley:
This is surely growing into a monster without creating very big arrays and manipulating those in END condition.

Code assumes that the string RHS_12_1510FF_0706 will always have 4 values if separated by _ and started with RHS on $10 field
And on the same line $2 field will start with string CONTROL having 2 values if separated by space char.

Or to color a bit red string has two values when seperated by space, while blue string has 4 values when separated by _ .
Lines which do not match both conditions (containing both blue and red) will not be processed by that specific code part (marked as green in whole program)

994479;CONTROL 1446-1521,1546-1621;S;P201921;115;N;4,4;M;0;RHS_12_1519FF_0706;*;B7005;10/17/2018 16:49

I'm trying to write this, so the program can run with small memory footprint on large files, keeping the original order in file untouched.
Hopefully others more experienced awkers can tell me if i succeeded or this is just a mess.
Feel free to chip in boys :o

BEGIN {
OFS=FS=";"
mweek=52
}

NR > 1 {
match($4,/P[12][0-9][0-9][0-9][0-5][0-9]/)
dw=substr($4,RSTART+1,RLENGTH-1)
year=substr(dw,1,4)
week=substr(dw,5,6)
variance=( year > 2015 ) ? 6 : 5

if ( ($10 ~ /^RHS/) && ($2 ~ /^CONTROL/) ) {
	split($10,g,"_")
	u=g[1]"_"g[2]"_"int(g[3]) - variance"FF_"g[4]
	sub($10,u,$10)
	
	split($2,c," ") ; gsub(",|-"," ",c[2]) ; split(c[2],e," ")
	fnl=e[1] - variance "-" e[2] - variance "," e[3] - variance "-" e[4] - variance
	sub($2,"CONTROL " fnl,$2)
	}

if ( match($2,/PP[0-9]+/) ) {
	a="PP"substr($2,RSTART+2,RLENGTH-2) - variance
	sub(substr($2,RSTART,RLENGTH),a,$2)
	}

if ( match($2,/^[0-9]+/) ) {
	f=substr($2,RSTART,RLENGTH) - variance
	sub("^[0-9]+",f,$2)
	}
	 
if ( int(week) == variance ) {
	week=mweek ; year=year - 1 ; sub(dw,year week,$0)
	}
else if ( int(week) < variance ) {
	week=mweek - (variance - week) ; year=year-1 ; sub(dw,year week,$0)
	}
	else {
	week=sprintf("%02d",week - variance) ; sub(dw,year week,$0)
	}

} 1

Hope that helps
Regards
Peasant

1 Like