Adding Fields to the file

Hi All,

I get a file on weekly basis from client. I need to write a script which make sure the file should have 20 columns after the first column. If not then the script should add the remaining columns and default them to space(except for 2nd and 3rd). and at the same time the script should check the second and the third column if null then default those to ND.

Sample Input file:

0000000000||3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|
0000010001|CASH||CASH|CASH|N/A|CASH|N/A|
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"

Sample Outfile Required:

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A| | | | | | | | | | | |
0000010001|CASH|ND|CASH|CASH|N/A|CASH|N/A| | | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"| | | | | | | | | | |

Any help appreciated. Thanks in advance :slight_smile:

Hi Arun,

Please code tags. don't use icode tags. icode tags are used for displaying in the line itself like this ...

Try

awk -F\| '{for(i=(NF+1);i<=20;i++){$i=" "};$2=$2?$2:"ND";$3=$3?$3:"ND";}1' OFS="|" file

Wow Pamu .. U seem to be having answer for every question.. grreatttt..
And thanks very much. It just worked fine for me. I verified from all scenarios and it was just fine cool!!!!!:b:.

There's a couple of other ways you can do this in awk. The shorter one I forget, but there's this:

$ awk 'BEGIN {FS=OFS="|"} NF<19{$19=FS}1' file
0000000000||3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A||||||||||||
0000010001|CASH||CASH|CASH|N/A|CASH|N/A||||||||||||
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"||||||||||

Scott,

Would this satisfy the condition "the script should check the second and the third column if null then default those to ND" ?? :slight_smile:

I guess not.

Um. Err.

Maybe I should have read the whole question!

I like pamu's solution :slight_smile:

Thanks!

I like concept of pamu's solution also, but it has a few problems:

  1. If there is an empty line or a line with only one field, fields 2 and 3 will be set to a space; not "ND".
  2. If there is a line with three fields in the input file, field 3 will be set to a space no matter what it originally contained.
  3. If a line has four to twenty fields (inclusive), the contents of the last field present will be replaced by a space.

I think the following does what was requested:

awk 'BEGIN{FS = OFS = "|"}
{       if(NF == 0) $1 = " "
        $2 = $2 ? $2 : "ND"
        $3 = $3 ? $3 : "ND"
        for(i = NF+1; i <= 20; i++) $i = " "
        print
}' file
2 Likes

What also should be taken into account is the last field on line 3 that contains a quoted field with a pipe-symbol inside that is not a field separator:

awk '{for(i=1;i<=NF;i+=2)gsub(/\|/,"�",$i)}1' FS=\" OFS=\" infile |
awk '$2==""{$2="ND"} $3==""{$3="ND"}{for(i=1; i<=20; i++) if($i=="")$i=" "}1' FS=� OFS=\|

produces:

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A| | | | | | | | | | | | 
0000010001|CASH|ND|CASH|CASH|N/A|CASH|N/A| | | | | | | | | | | | 
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|"Tier2,PA|NPF"| | | | | | | | | | | 
2 Likes

So this is a CSV file (except that it's using the pipe character as delimiters instead of commas). Whatever is used to process this file should have the ability to parse it. Therefore you should do it inside that tool. For example, if you load this file to Oracle table using sqlldr, it's very easy to create a load control file to populate the "ND" and " " values using NVLs.
gawk 4.0 has the ability to parse CSV files:

gawk4 '
BEGIN { 
  OFS = "|" 
  FPAT = "([^|]*)|(\"[^\"]+\")"
}
{
  for (i = 1; i < 20; ++i)
    if ($i == "") $i = (i < 9)? "ND" : " "
  if (NF < 20) $20 = ""
  print
}'

Hi Don,

I agree about NF+1 that was my mistake.. My bad..

But for other two i just need to shift my code...:slight_smile: and for getting desired output just need to add one more condition.. :smiley:

awk -F\| '{$1=$1?$1:" ";$2=$2?$2:"ND";$3=$3?$3:"ND"; for(i=(NF+1);i<=20;i++){$i=" "}}1' OFS="|" file

Thanks @Don Cragun, Scott Pamu and all who have replied to this thread. Just wanted to know from Pamu will the final awk statement provided by you solves the problem what ever was figured out by @Don Cragun?

Hi Arun,

I also Like Don's solution.:slight_smile:

Yes all the problems were found out by Don was resolved in my latest version.
But the query raised by Scrutinizer, if you have "|" inside double quotes and you don't want to consider this as FS. then you should go for Scrutinizer's solution from post 8.

pamu

Thanks Pamu but my original CSV file was:

plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"

where i had comma inside the qoutes. Then i converted it to pipeline separated file using:

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" ${Inpt_File} > temp_file.txt
awk -F\| '{$1=$1?$1:" ";$2=$2?$2:"ND";$3=$3?$3:"ND"; for(i=(NF+1);i<=20;i++){$i=" "}}1' OFS="|" file

Outputs:

plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor|ND|ND| | | | | | | | | | | | | | | | |
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,|ND|ND| | | | | | | | | | | | | | | | |
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,|ND|ND| | | | | | | | | | | | | | | | |

ND should be in 2nd or 3rd column instead it is populated smwhere else.

Can you specify how to do it? And Please if you can remove headers along with this.

Hi Arun,

I think you are trying out one sollution for another problem..

Is this what you want..?

$ cat file
plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"

$ awk 'NR%2{gsub(",","|")}1' RS='"' ORS= file | awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=(NF+1);i<=20;i++){$i=" "}}}p' OFS="|"

0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|| | | | | | | | | | |
0000010001|CASH|1|CASH|CASH|N/A|CASH|N/A|| | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|Tier2,PA,NPF| | | | | | | | | | |

Note that there is a difference between a field not being present and a field being present but empty. With an empty line as input, both of our scripts will produce:

 |ND|ND| | | | | | | | | | | | | | | | |

But with the input line:

|||

(four empty fields separated by three vertical bars) your script will produce:

 |ND|ND|| | | | | | | | | | | | | | | |

while my script will produce:

|ND|ND|| | | | | | | | | | | | | | | |

with the difference being the leading space produced by your script.

if(NF == 0) $1 = " "

in my script creates $1 as a field containing a space if there was no 1st field while

$1=$1?$1:" "

in your script changes an existing empty $1 to a space. Note that both of our scripts kept the existing empty $4 as an empty field.

Thanks for the reply. I was actually looking for one solution s u said for both the issues. But your code is not working as when i use RS and ORS i get error for some columns as specified below:

awk: Input line
0000030075,IPA,4,AE cannot be longer than 3,000 bytes.
 The input line number is 44. The file is DCIA_C150_R01_PLN_SEG_DATA.TXT.
 The source line number is 1.

It seems i am using HP-UNIX that's why i cannot use the above code.
Do you have any alternative way so i could have one solution for both the issues.

---------- Post updated at 04:06 PM ---------- Previous update was at 03:50 PM ----------

There is an addition to the requirement as Don Mentioned. The 4th field or any other existing field(apart from 2nd and 3rd which if null should be defaulted to ND) if null should also be defaulted to space along with the additional fields being added

Help as always appreciated. :slight_smile:

Try

$ cat file
plan_dk,imsmodelvar,paymt_type_az,mmars_plan,mmars_payer,mmars_pbm,account,botpbm,formulary_Crestor
0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"
$ awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" file| awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=(NF+1);i<=20;i++){$i=" "}}}p' OFS="|"
0000000000|ND|3|MAIL ORDER|MAIL ORDER DATA|N/A|MAIL ORDER|N/A|| | | | | | | | | | |
0000010001|CASH|1|CASH|CASH|N/A|CASH|N/A|| | | | | | | | | | |
0000020002|FFS MED|2|AL MEDICAID|MEDICAID|N/A|AL MEDICAID|N/A|Tier2,PA,NPF| | | | | | | | | | |

Hi Don,

As Per OP's Requirement it should have space.

Thanks Pamu , that was awesome but there is a small change in the requirement i do not want any null field to be present. If there is an existing null field the script should default it to space along with the additional fields being added.

Try

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}} NF==1{gsub(",","|")}1' OFS="" file| awk -F\| '$1==""{$1=" "} $2==""{$2="ND"} $3==""{$3="ND"} {if(NR>1){p=1} {for(i=1;i<=20;i++){if($i==""){$i=" "}}}}p' OFS="|"
1 Like

Hi pamu,
I believe this meets all of Arun's updated requirements.

Could you explain why you thought you needed the NF>1 and the NF==1 tests in the first invocation of awk?

I'm not a big fan of 1-liners when it makes the code hard to read. I much prefer:

awk -F '"' '
NR > 1{	for(i = 1; i <= NF; i += 2) gsub(/,/, "|", $i)
	print
}' OFS="" file | awk 'BEGIN{FS = OFS = "|"}
{	for(i = 1; i <= 20; i++)
		$i = $i != "" ? $i : $i = (i == 2 || i == 3) ? "ND" : " "
	print
}'

which is still slightly shorter than your script and produces the same results.

If I strip it down to a 1-liner, it is about 30% shorter, but it is much harder for me to read and understand:

awk -F '"' 'NR>1{for(i=1;i<=NF;i+=2)gsub(/,/,"|",$i)}NR>1' OFS="" file|awk 'BEGIN{FS=OFS="|"}{for(i=1;i<=20;i++)$i=$i!=""?$i:$i=(i>1&&i<4)?"ND":" "}1''

but it still produces the same output.

1 Like