Adding Fields to the file

Arun_Mishra · December 17, 2012, 5:27am

Hi All,

I get a file on weekly basis from client. I need to write a script which make sure the file should have 20 columns after the first column. If not then the script should add the remaining columns and default them to space(except for 2nd and 3rd). and at the same time the script should check the second and the third column if null then default those to ND.

Sample Input file:

0000000000||3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|
0000010001|CASH||CASH|CASH|N/A|CASH|N/A|
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"

Sample Outfile Required:

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A| | | | | | | | | | | |
0000010001|CASH|ND|CASH|CASH|N/A|CASH|N/A| | | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"| | | | | | | | | | |

Any help appreciated. Thanks in advance

pamu · December 17, 2012, 5:34am

Hi Arun,

Please code tags. don't use icode tags. icode tags are used for displaying in the line itself like this ...

Try

awk -F\| '{for(i=(NF+1);i<=20;i++){$i=" "};$2=$2?$2:"ND";$3=$3?$3:"ND";}1' OFS="|" file

Arun_Mishra · December 17, 2012, 5:55am

Wow Pamu .. U seem to be having answer for every question.. grreatttt..
And thanks very much. It just worked fine for me. I verified from all scenarios and it was just fine cool!!!!!.

Scott · December 17, 2012, 6:45am

There's a couple of other ways you can do this in awk. The shorter one I forget, but there's this:

$ awk 'BEGIN {FS=OFS="|"} NF<19{$19=FS}1' file
0000000000||3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A||||||||||||
0000010001|CASH||CASH|CASH|N/A|CASH|N/A||||||||||||
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"||||||||||

PikK45 · December 17, 2012, 6:46am

Scott,

Would this satisfy the condition "the script should check the second and the third column if null then default those to ND" ??

I guess not.

Scott · December 17, 2012, 6:50am

Um. Err.

Maybe I should have read the whole question!

I like pamu's solution

Thanks!

Don_Cragun · December 18, 2012, 12:30pm

I like concept of pamu's solution also, but it has a few problems:

If there is an empty line or a line with only one field, fields 2 and 3 will be set to a space; not "ND".
If there is a line with three fields in the input file, field 3 will be set to a space no matter what it originally contained.
If a line has four to twenty fields (inclusive), the contents of the last field present will be replaced by a space.

I think the following does what was requested:

awk 'BEGIN{FS = OFS = "|"}
{       if(NF == 0) $1 = " "
        $2 = $2 ? $2 : "ND"
        $3 = $3 ? $3 : "ND"
        for(i = NF+1; i <= 20; i++) $i = " "
        print
}' file

Scrutinizer · December 18, 2012, 3:25pm

What also should be taken into account is the last field on line 3 that contains a quoted field with a pipe-symbol inside that is not a field separator:

awk '{for(i=1;i<=NF;i+=2)gsub(/\|/,"�",$i)}1' FS=\" OFS=\" infile |
awk '$2==""{$2="ND"} $3==""{$3="ND"}{for(i=1; i<=20; i++) if($i=="")$i=" "}1' FS=� OFS=\|

produces:

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A| | | | | | | | | | | | 
0000010001|CASH|ND|CASH|CASH|N/A|CASH|N/A| | | | | | | | | | | | 
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"| | | | | | | | | | |

binlib · December 18, 2012, 8:37pm

So this is a CSV file (except that it's using the pipe character as delimiters instead of commas). Whatever is used to process this file should have the ability to parse it. Therefore you should do it inside that tool. For example, if you load this file to Oracle table using sqlldr, it's very easy to create a load control file to populate the "ND" and " " values using NVLs.
gawk 4.0 has the ability to parse CSV files:

gawk4 '
BEGIN { 
  OFS = "|" 
  FPAT = "([^|]*)|(\"[^\"]+\")"
}
{
  for (i = 1; i < 20; ++i)
    if ($i == "") $i = (i < 9)? "ND" : " "
  if (NF < 20) $20 = ""
  print
}'

pamu · December 19, 2012, 2:14am

don cragun:

I like concept of pamu's solution also, but it has a few problems:

If there is an empty line or a line with only one field, fields 2 and 3 will be set to a space; not "ND".

If there is a line with three fields in the input file, field 3 will be set to a space no matter what it originally contained.

If a line has four to twenty fields (inclusive), the contents of the last field present will be replaced by a space.

I think the following does what was requested:
awk 'BEGIN{FS = OFS = "|"}
{       if(NF == 0) $1 = " "
   $2 = $2 ? $2 : "ND"
   $3 = $3 ? $3 : "ND"
   for(i = NF+1; i <= 20; i++) $i = " "
   print
}' file

Hi Don,

I agree about NF+1 that was my mistake.. My bad..

But for other two i just need to shift my code... and for getting desired output just need to add one more condition..

awk -F\| '{$1=$1?$1:" ";$2=$2?$2:"ND";$3=$3?$3:"ND"; for(i=(NF+1);i<=20;i++){$i=" "}}1' OFS="|" file

Arun_Mishra · December 19, 2012, 2:19am

Thanks @Don Cragun, Scott Pamu and all who have replied to this thread. Just wanted to know from Pamu will the final awk statement provided by you solves the problem what ever was figured out by @Don Cragun?

pamu · December 19, 2012, 2:34am

Hi Arun,

I also Like Don's solution.

Yes all the problems were found out by Don was resolved in my latest version.
But the query raised by Scrutinizer, if you have "|" inside double quotes and you don't want to consider this as FS. then you should go for Scrutinizer's solution from post 8.

pamu

Arun_Mishra · December 19, 2012, 3:07am

Thanks Pamu but my original CSV file was:

plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"

where i had comma inside the qoutes. Then i converted it to pipeline separated file using:

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" ${Inpt_File} > temp_file.txt

awk -F\| '{$1=$1?$1:" ";$2=$2?$2:"ND";$3=$3?$3:"ND"; for(i=(NF+1);i<=20;i++){$i=" "}}1' OFS="|" file

Outputs:

plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor|ND|ND| | | | | | | | | | | | | | | | |
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,|ND|ND| | | | | | | | | | | | | | | | |
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,|ND|ND| | | | | | | | | | | | | | | | |

ND should be in 2nd or 3rd column instead it is populated smwhere else.

Can you specify how to do it? And Please if you can remove headers along with this.

pamu · December 19, 2012, 4:30am

Hi Arun,

I think you are trying out one sollution for another problem..

Is this what you want..?

$ cat file
plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"

$ awk 'NR%2{gsub(",","|")}1' RS='"' ORS= file | awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=(NF+1);i<=20;i++){$i=" "}}}p' OFS="|"

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|| | | | | | | | | | |
0000010001|CASH|1|CASH|CASH|N/A|CASH|N/A|| | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|Tier2,PA,NPF| | | | | | | | | | |

Don_Cragun · December 19, 2012, 5:19am

pamu:

Hi Don,

I agree about NF+1 that was my mistake.. My bad..

But for other two i just need to shift my code... and for getting desired output just need to add one more condition..
awk -F\| '{$1=$1?$1:" ";$2=$2?$2:"ND";$3=$3?$3:"ND"; for(i=(NF+1);i<=20;i++){$i=" "}}1' OFS="|" file

Note that there is a difference between a field not being present and a field being present but empty. With an empty line as input, both of our scripts will produce:

 |ND|ND| | | | | | | | | | | | | | | | |

But with the input line:

|||

(four empty fields separated by three vertical bars) your script will produce:

 |ND|ND|| | | | | | | | | | | | | | | |

while my script will produce:

|ND|ND|| | | | | | | | | | | | | | | |

with the difference being the leading space produced by your script.

if(NF == 0) $1 = " "

in my script creates $1 as a field containing a space if there was no 1st field while

$1=$1?$1:" "

in your script changes an existing empty $1 to a space. Note that both of our scripts kept the existing empty $4 as an empty field.

Arun_Mishra · December 19, 2012, 5:36am

Thanks for the reply. I was actually looking for one solution s u said for both the issues. But your code is not working as when i use RS and ORS i get error for some columns as specified below:

awk: Input line
0000030075,IPA,4,AE cannot be longer than 3,000 bytes.
 The input line number is 44. The file is DCIA_C150_R01_PLN_SEG_DATA.TXT.
 The source line number is 1.

It seems i am using HP-UNIX that's why i cannot use the above code.
Do you have any alternative way so i could have one solution for both the issues.

---------- Post updated at 04:06 PM ---------- Previous update was at 03:50 PM ----------

There is an addition to the requirement as Don Mentioned. The 4th field or any other existing field(apart from 2nd and 3rd which if null should be defaulted to ND) if null should also be defaulted to space along with the additional fields being added

Help as always appreciated.

pamu · December 19, 2012, 5:47am

Try

$ cat file
plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"

$ awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" file| awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=(NF+1);i<=20;i++){$i=" "}}}p' OFS="|"
0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|| | | | | | | | | | |
0000010001|CASH|1|CASH|CASH|N/A|CASH|N/A|| | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|Tier2,PA,NPF| | | | | | | | | | |

Hi Don,

As Per OP's Requirement it should have space.

Arun_Mishra · December 19, 2012, 6:21am

Thanks Pamu , that was awesome but there is a small change in the requirement i do not want any null field to be present. If there is an existing null field the script should default it to space along with the additional fields being added.

pamu · December 19, 2012, 6:47am

Try

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" file| awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=1;i<=20;i++){if($i==""){$i=" "}}}}p' OFS="|"

Don_Cragun · December 19, 2012, 1:13pm

pamu:

Try

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" file| awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=1;i<=20;i++){if($i==""){$i=" "}}}}p' OFS="|"

Hi pamu,
I believe this meets all of Arun's updated requirements.

Could you explain why you thought you needed the NF>1 and the NF==1 tests in the first invocation of awk?

I'm not a big fan of 1-liners when it makes the code hard to read. I much prefer:

awk -F '"' '
NR > 1{	for(i = 1; i <= NF; i += 2) gsub(/,/, "|", $i)
	print
}' OFS="" file | awk 'BEGIN{FS = OFS = "|"}
{	for(i = 1; i <= 20; i++)
		$i = $i != "" ? $i : $i = (i == 2 || i == 3) ? "ND" : " "
	print
}'

which is still slightly shorter than your script and produces the same results.

If I strip it down to a 1-liner, it is about 30% shorter, but it is much harder for me to read and understand:

awk -F '"' 'NR>1{for(i=1;i<=NF;i+=2)gsub(/,/,"|",$i)}NR>1' OFS="" file|awk 'BEGIN{FS=OFS="|"}{for(i=1;i<=20;i++)$i=$i!=""?$i:$i=(i>1&&i<4)?"ND":" "}1''

but it still produces the same output.