Complex transpose awk script

Hello to all in forum,

Maybe an awk expert could help me with this complex task for me.

I have the input shown below and I would like to get the output as follow:

  • I would like the output separated by commas.
  • The header is fixed and will be the same always.
  • For the lines containing OCTDP, TCTDP and GPRSTDP I would like to join their values separated by "|", so they joined in the output will represent a single field (I've highlighted for best understanding)
  • Each value below " CAMEL SUBSCRIPTION OPTIONS" would go in different fields in the output.
  • When any value is not found, leave blank.

Output desired:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,2|15|2038759581|0|2||0,18|15|2038759581|1|2|Y|,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,2|15|2038759581|0|2||7,18|15|2038759594|1|2|Y|,18|15|20387592403|0|3||,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,18|15|2038759581|1|2|Y|,18|15|20387592403|0|3||,1,3,0,1,2,0,1,1,0,2,2,2,2

The Input is:

<HGCMP:MSISDN=20385503286,OPT;
HLR CAMEL SUBSCRIPTION DATA

MSISDN           CSP
20385503286

TDPTYPE  TDP  SK         GSA              DEH   CCH I  DIALNUM
OCTDP      2          15 2038759581       0     2
TCTDP     18          15 2038759581       1     2   Y

CAMEL SUBSCRIPTION OPTIONS

GCSO  MCSO  SSLO  GC2SO  MC2SO  TIF  GPRSSO  OSMSSO
0     5     0     1      2      0    1       0

GC3SO  MC3SO  GC4SO  MC4SO
1      0      1      1

MMSO
1

END

<HGCMP:MSISDN=20386671719,OPT;
HLR CAMEL SUBSCRIPTION DATA

MSISDN           CSP
20386671719

TDPTYPE  TDP  SK         GSA              DEH   CCH I  DIALNUM
OCTDP      2          15 2038759581       0     2      7
TCTDP     18          15 2038759594       1     2   Y
GPRSTDP   18          15 20387592403      0     3

CAMEL SUBSCRIPTION OPTIONS

GCSO  MCSO  SSLO  GC2SO  MC2SO  TIF  GPRSSO  OSMSSO
0     5     0     1      2      0    1       0

GC3SO  MC3SO  GC4SO  MC4SO
1      2      1      2

MMSO
0

END

<HGCMP:MSISDN=20387323200,OPT;
HLR CAMEL SUBSCRIPTION DATA

MSISDN           CSP
20387323200

TDPTYPE  TDP  SK         GSA              DEH   CCH I  DIALNUM
TCTDP     18          15 2038759581       1     2   Y
GPRSTDP   18          15 20387592403      0     3

CAMEL SUBSCRIPTION OPTIONS

GCSO  MCSO  SSLO  GC2SO  MC2SO  TIF  GPRSSO  OSMSSO
1     3     0     1      2      0    1       1

GC3SO  MC3SO  GC4SO  MC4SO
0      2      2      2

MMSO
2

END

Thanks in advance for any help.

It seems more difficult than it actually is:

awk 'BEGIN {split("OCTDP TCTDP GPRSTDP",TDPAr);
            print "MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO"}
     /^MSISDN/   {getline; rec=$0}
     /^TDPTYPE/  {while ($0 !~ /^$/)
                     {getline; OFS="|"; tmp=$1; $1=""; sub("\|",""); tmpAr[tmp]=$0
                     }
                     for (i=1;i<=3;i++) rec=rec","tmpAr[TDPAr]; delete tmpAr
                 }
     /^(GCSO|GC3SO|MMSO)/ {
                      getline; OFS=","; $1=$1; rec=rec","$0}
     /END/       {print rec; rec=""}
    ' file
MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,2|15|2038759581|0|2,18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,18|15|2038759581|1|2|Y,18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2

What I did not cover is the missing I and DIALNUM fields - they are not identifyable by separators separating empty fields. Well, it's up to you to introduce a counter and fill those gaps ...

Hello RudiC,

Many thanks for your help, it works!.With getline you read the row below the string searched? Only to learn, is there other way to read line below matched string?

Besides this, if a number doesn't have values in the input, I would like to print the number in field 1 and "No_values" in the 2nd field. example:

Format when a number doesn't have values:

<HGCMP:MSISDN=20385503289,OPT;
HLR CAMEL SUBSCRIPTION DATA

MSISDN           CSP
20385503289

TDPTYPE  TDP  SK         GSA              DEH   CCH I  DIALNUM

NONE

END

Output desired for this case:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,2|15|2038759581|0|2,18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,18|15|2038759581|1|2|Y,18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2
20385503289,No_values

What code could be added to your script to get this part combined in the output?

Thanks in advance

You should provide a sample on how the output looks like when 20385503289 have no value.
Then it could be made an IF test to print No_values if an field is missing.
Regarding I and DAILNUM , they can be found by looking at their position from the left, if its always the same location.

Try

     /END/       {if (length(rec)<=11) rec=rec"no value"; print rec; rec=""} 

@Jotne: May be difficult to identify I and DIALNUM as sometimes one or the other is present (NF==7, which is which?) , sometimes both (easy! NF==8), and sometimes none (easy! NF==6) of them.

Hello RudiC,

I've tried:

awk 'BEGIN {split("OCTDP TCTDP GPRSTDP",TDPAr);
            print "MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO"}
     /^MSISDN/   {getline; rec=$0}
     /^TDPTYPE/  {while ($0 !~ /^$/)
                     {getline; OFS="|"; tmp=$1; $1=""; sub("|",""); tmpAr[tmp]=$0
                     }
                     for (i=1;i<=3;i++) rec=rec","tmpAr[TDPAr]; delete tmpAr
                 }
     /^(GCSO|GC3SO|MMSO)/ {
                      getline; OFS=","; $1=$1; rec=rec","$0}
    /END/       {if (length(rec)<=11) rec=rec"no value"; print rec; rec=""}' input

but the output is:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,|2|15|2038759581|0|2,|18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,|18|15|2038759581|1|2|Y,|18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2
20385503289,,,

So, prints the number, but with commas and "no value" is not printed.

But the "if" to print "no value" could be without using length()? because the length of the numbers could be less or greater than 11 digits.

So, it is possible to do that "if" to print "no value" checking if below TDPTYPE is blank or if the string NONE is found?

Thanks in advance again for your kind help!

I is at position 53 and DIALNUM at 56 if I count correctly, so it should be not a big problem to take it out. NF will not work.

@Jotne: Ooh, got you, you were talking char positions... Yes, that should work.
@Ophiuchus: use old END statement and put in a line before:

     /^NONE/   {getline; rec=rec ",no values"} 

Be aware that all of this depends heavily on the structure of your file, i.e. the sequence of keywords and lines. Should this not be stable/reliable, a far more sophisticated code would be necessary...

Hello RudiC,

Thaank you. I've changed it. Still appear 3 more commas but is not a big deal.

I think I can remove them.

Many thanks for the help.

It would have been good to see the proposal of solution by jotne, to learn other method.

Thanks anyway for the comments.

Regards

The three commas are due to the fact that you wanted all three OCTDP TCTDP GPRSTDP fields listed even when empty. No need to remove them externally, can be done within the awk code. BUT - it would have been nice had the specification been complete from the beginning...
Try:

awk 'BEGIN {split("OCTDP TCTDP GPRSTDP",TDPAr);
            print "MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO"}
     /^MSISDN/   {getline; rec=$0}
     /^TDPTYPE/  {noflds=0
          while (1)
                     {getline; if ($0 ~ /^$/) break
              OFS="|"; tmp=$1; $1=""; sub("\|",""); tmpAr[tmp]=$0; noflds=3 ;
                     }
                     for (i=1; i<=noflds; i++) rec=rec","tmpAr[TDPAr]; delete tmpAr
                 }
     /^(GCSO|GC3SO|MMSO)/ {
                      getline; OFS=","; $1=$1; rec=rec","$0}
     /^NONE/   {getline; rec=rec ",no values"}
     /END/       {print rec; rec=""}
    ' file
MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,2|15|2038759581|0|2,18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,18|15|2038759581|1|2|Y,18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2
20385503289,no values

Hello RudiC,

Thanks for your answer! It works, and I understan, I've especified the condition when a number don't have values in post #3. Thank youu!

Last question:

I have your awk script in a file Script.sh and works, but I have inmediately below your code another awk script that uses the output of your script as input,
but when I execute the script as . Script.sh input.txt only is executed the first script but if I execute both scripts individually work fine.

I have like this within Script.sh:

Input=input.txt
Output=output.txt

awk 'BEGIN {split("OCTDP TCTDP GPRSTDP",TDPAr);
            print "MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO"}
     /^MSISDN/   {getline; rec=$0}
     /^TDPTYPE/  {noflds=0
          while (1)
                     {getline; if ($0 ~ /^$/) break
              OFS="|"; tmp=$1; $1=""; sub("\|",""); tmpAr[tmp]=$0; noflds=3 ;
                     }
                     for (i=1; i<=noflds; i++) rec=rec","tmpAr[TDPAr]; delete tmpAr
                 }
     /^(GCSO|GC3SO|MMSO)/ {
                      getline; OFS=","; $1=$1; rec=rec","$0}
     /^NONE/   {getline; rec=rec ",no values"}
     /END/       {print rec; rec=""}
    ' $Input > $Output
    
awk '...' $Output > Output.csv  # This second script is not executed.    

Maybe you can suggest me or see what could happen.

Thanks in advance.

I can't see a reason why the second script should not be executed. Maybe it is, and its output is not what you expect?
Aside from piping it from the first one, there certainly will be opportunities to fulfill your requirements in one single awk script.
Why don't you post the second awk script, or, the output you expect?

Hello RudiC,

Simply it seems the 2nd script doesnt execute.

The 2nd awk script verifies if the fields are different to the "correct values" that they should have configured, so using the same sample, the output of your script gives:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO
20385503286,|2|15|2038759581|0|2,|18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1
20386671719,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,,|18|15|2038759581|1|2|Y,|18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2
20385503289,No Values

well, this should be the input for the next awk script that will compare concatenating all fields from 2 to last field in order to determine which one is different based on the parameters in red.

awk '{  x = substr($0, index($0, ","))
        gsub(/,/, "", x)
        if(x !~ /2\|15\|2038759581\|0\|2\|7,18\|15\|2038759594\|1\|2\|Y,18\|15\|20387592403\|0\|30501201012120/) print $0, "Different"
}' OFS=","    Input1 > output1.csv 

The output I have so far is (runnig one awk script after another):

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO,STATE
20385503286,2|15|2038759581|0|2,18|15|2038759581|1|2|Y,,0,5,0,1,2,0,1,0,1,0,1,1,1,Different
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0,
20387323200,,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2,Different
20385503289,No_values,,,,,,,,,,,,,,,,,Different

This 2nd awk code works saying me which number has different "something" in its values, but doesn't especify which paarmeter/parameters is/are different.
If the number has something different in its parameters, I would like the same output shown above, but adding the suffix "_I" for the parameter that is different as below:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO,STATE
20385503286,2|15|2038759581|0|2_I,18|15|2038759581|1|2|Y_I,_I,0,5,0,1,2,0,1,0,1,0_I,1,1_I,1_I,Different
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0,
20387323200,_I,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,1_I,3_I,0,1,2,0,1,1_I,0_I,2,2_I,2,2_I,Different
20385503289,No_values,,,,,,,,,,,,,,,,

But, If adding the "_I" to the output becomes too much difficult, please only help me if it possible, to join the 2 awk scripts I have so far, in a single one :).

Thanks in advance for any help.

I'd propose to put the line to be compared to into another file like:

$ cat tobe
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0

and then run this modified code against both files:

awk 'BEGIN {split("OCTDP TCTDP GPRSTDP",TDPAr);
            print "MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO,STAT"
     }

     NR==FNR     {cnt = split ($0, cmpAr, ","); next}

     /^MSISDN/   {getline; rec=$0}
     /^TDPTYPE/  {noflds=0
                  while (1)
                     {getline; if ($0 ~ /^$/) break
                      OFS="|"; tmp=$1; $1=""; sub("\|",""); tmpAr[tmp]=$0; noflds=3 ;
                     }
                  for (i=1; i<=noflds; i++) rec=rec","tmpAr[TDPAr]
                 }
     /^(GCSO|GC3SO|MMSO)/ {
                  getline; OFS=","; $1=$1; rec=rec","$0}
     /^NONE/     {getline; rec=rec ",no values"}
     /END/       {split (rec, tmpAr, ","); rec=tmpAr[1]
                  for (i=2;i<=cnt;i++) {if (tmpAr != cmpAr) {tmpAr=tmpAr"_I"; dif=1} rec=rec","tmpAr}
                  if (dif) rec=rec",different"
                  print rec; rec=""; delete tmpAr; dif=0}
    ' tobe file
MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO,STAT
20385503286,2|15|2038759581|0|2_I,18|15|2038759581|1|2|Y_I,_I,0,5,0,1,2,0,1,0,1,0_I,1,1_I,1_I,different
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,_I,18|15|2038759581|1|2|Y_I,18|15|20387592403|0|3,1_I,3_I,0,1,2,0,1,1_I,0_I,2,2_I,2,2_I,different
20385503289,no values_I,_I,_I,,_I,,_I,_I,,_I,,_I,_I,_I,_I,,different

Hello again RudiC,

Thank you for your effort to help me!.

It seems to work, but only if file "tobe" has only one line to be compared and actually I need to compare 4 lines. The difference between 4 lines are highlighted in red.

x,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
x,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592407|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
x,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592408|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
x,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592409|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0

In the 2nd script I was doing an OR within the IF like below:

if(x !~ /2\|15\|2038759581\|0\|2\|7,18\|15\|2038759594\|1\|2\|Y,18\|15\|2038759240(3|7|8|9)\|0\|30501201012120/) print $0, "Different"

I'm attaching a little bit greater sample, for which the output should be like below:

MSISDN,OCTDP,TCTDP,GPRSTDP,GCSO,MCSO,SSLO,GC2SO,MC2SO,TIF,GPRSSO,OSMSSO,GC3SO,MC3SO,GC4SO,MC4SO,MMSO,STATUS
20385503286,2|15|2038759581|0|2_I,18|15|2038759581|1|2|Y_I,_I,0,5,0,1,2,0,1,0,1,0_I,1,1_I,1_I,Different
20386671719,2|15|2038759581|0|2|7,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20387323200,_I,18|15|2038759594|1|2|Y,18|15|20387592403|0|3,1,3,0,1,2,0,1,1,0,2,2,2,2,Different
20385503289,No_values
20386671088,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592408|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20386671452,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592407|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0
20386671490,|2|15|2038759581|0|2|7,|18|15|2038759594|1|2|Y,|18|15|20387592409|0|3,0,5,0,1,2,0,1,0,1,2,1,2,0

I hope is not to complicated for you to fix this part :slight_smile:

Many thanks in advance.

Regards.

Why don't you give it a trial yourself? Read n lines from tobe into a two dimensional array (split into tmp array, copy into 2dim array), and then, in the comparison part, replace the unequality test by a loop through the 2dim array's elements?

Hello and thanks for your great help RudiC,

I'll try to get what you suggest me to add it in the script. I hope I can do it.

Thanks again for all support!

Regards

If you encounter a problem, post your trial here so we can assist you to help yourself.

Hello RudiC again!

I'm not sure how to load the array with more lines, it should be like this?

NR==FNR     {cnt[NR] = split ($0, Temp[NR], ","); next}

Maybe you can help me with how to load correctly the array and then I think I'll be in better way to insert it in comparison part.

Thanks for any help.

Regards

Well, try it like (untested):

NR==FNR     {cnt[NR] = split ($0, tmp, ","); for (i=1;i<=cnt[NR]; i++) cmpAr[NR,i]=tmp; next}

You now have the lines NR split into cmpAr [NR...] ready for the testing loop.