Help tabulating file putting repeated strings as headers

Ophiuchus · April 11, 2018, 10:44pm

Hi. May somebody help me with this.

I�m trying to tabulate the following input file, but the desired output I�m getting is incorrect.

I have access to GNU/LINUX (Ubuntu) and Cygwin

Input file

STAGE = 1
ID = 0
NAME = JFMSC
TYPE = MLRR
DFRUL = PERMISSION
ADDR = 1001
RRUL = PERMISSION
SPRR = TRUE
ISGALW = FALSE
ISUTWD = FALSE

STAGE = 1
ID = 2
NAME = PLLSJS
TYPE = MLRR
DFRUL = PERMISSION

STAGE = 1
ID = 4
NAME = AAAARR
TYPE = MLRR
DFRUL = RESTRICT
ADDR = 3553
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE
ADDR = 66444
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE
ADDR = 890087
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

STAGE = 1
ID = 0
NAME = PPROOA
TYPE = RRHN
DFRUL = PERMISSION
ADDR = 7034
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

This is the code I was able to construct so far

awk 'BEGIN{print "STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD"}
/ID/{a=$3}
/NAME/{b=$3}
/TYPE/{c=$3}
/DFRUL/{d=$3}
/ADDR/{f=$3}
/RRUL/{g=$3}
/SPRR/{h=$3}
/ISGALW/{i=$3}
/ISUTWD/{j=$3
  print a"|"b"|"c"|"d"|"f"|"g"|"h"|"i"|"j
}
' file.txt

My current output

STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD
0|JFMSC|MLRR|PERMISSION|1001|PERMISSION|TRUE|FALSE|FALSE
4|AAAARR|MLRR|RESTRICT|3553|PERMISSION|FALSE|FALSE|FALSE
4|AAAARR|MLRR|RESTRICT|66444|PERMISSION|FALSE|FALSE|FALSE
4|AAAARR|MLRR|RESTRICT|890087|PERMISSION|FALSE|FALSE|FALSE
0|PPROOA|RRHN|PERMISSION|7034|PERMISSION|FALSE|FALSE|FALSE

Desired output

STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD
1|0|JFMSC|MLRR|PERMISSION|1001|PERMISSION|TRUE|FALSE|FALSE
1|2|PLLSJS|MLRR|PERMISSION|||||
1|4|AAAARR|MLRR|RESTRICT|3553|PERMISSION|FALSE|FALSE|FALSE
|||||66444|PERMISSION|FALSE|FALSE|TRUE
|||||890087|PERMISSION|FALSE|TRUE|FALSE
1|0|PPROOA|RRHN|PERMISSION|7034|PERMISSION|FALSE|FALSE|FALSE

Thanks in advance.

RudiC · April 12, 2018, 2:46am

Extending your attempt:

awk '
BEGIN           {print "STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD"
                }
/STAGE/         {x=$3}
/ID/            {a=$3}
/NAME/          {b=$3}
/TYPE/          {c=$3}
/DFRUL/         {d=$3}
/ADDR/          {f=$3}
/RRUL/          {g=$3}
/SPRR/          {h=$3}
/ISGALW/        {i=$3}
/ISUTWD/        {j=$3
                 print x"|"a"|"b"|"c"|"d"|"f"|"g"|"h"|"i"|"j
                 a = b = c = d = e = f = g = h = i = j = x = ""
                }
' file
STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD
1|0|JFMSC|MLRR|PERMISSION|1001|PERMISSION|TRUE|FALSE|FALSE
1|4|AAAARR|MLRR|RESTRICT|3553|PERMISSION|FALSE|FALSE|FALSE
|||||66444|PERMISSION|FALSE|FALSE|FALSE
|||||890087|PERMISSION|FALSE|FALSE|FALSE
1|0|PPROOA|RRHN|PERMISSION|7034|PERMISSION|FALSE|FALSE|FALSE

Ophiuchus · April 12, 2018, 11:42am

Thanks so much RudiC.

Clear the variables was the key!

In this case there are 11 headers, but if there are N headers for which I want to get the value in $3, how would be a way to make a shortest script and avoid write this up to N?

headerNth = /StringNth/ {Nth=$3}

A kind of loop?

RudiC · April 12, 2018, 2:10pm

This

awk -F"[ =]+"  '
BEGIN           {HD = "STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD"
                 for (MX=n=split (HD, HDArr, "|"); n>0; n--) SRCH[HDArr[n]]
                 print HD
                }

!NF             {for (i=1; i<=MX; i++) printf "%s%s", RES[HDArr], (i == MX)?ORS:OFS
                 delete RES
                }

$1 in SRCH      {RES[$1] = $2
                }

END             {for (i=1; i<=MX; i++) printf "%s%s", RES[HDArr], (i == MX)?ORS:OFS
                }

' OFS="|" file
STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD
1|0|JFMSC|MLRR|PERMISSION|1001|PERMISSION|TRUE|FALSE|FALSE
1|2|PLLSJS|MLRR|PERMISSION|||||
1|4|AAAARR|MLRR|RESTRICT|890087|PERMISSION|FALSE|FALSE|FALSE
1|0|PPROOA|RRHN|PERMISSION|7034|PERMISSION|FALSE|FALSE|FALSE

will extract data based on the header fields and is thus very flexible. Unfortunately you didn't define the record terminator which would define the point at which to print a line. And, the records are non-uniformly structured: you have "incomplete" records (rec 2) and records with duplicate field entries (rec 3). You need to define very clearly and accurately what to extract and what to print, and adapt the script accordingly.

rdrtx1 · April 12, 2018, 2:11pm

awk '
BEGIN {lines=0; column_count=0}
$2 !~ /=/ || NF != 3 {next}
! column[$1]++ {columns[column_count++]=$1}
$1 ~ /^STAGE*$/ {lines++}
{column_data[$1, lines]=$3}
END {
   for (i=0; i<column_count; i++) if (columns) printf columns ((i<column_count-1) ? "|" : "\n")
   for (i=1; i <= lines; i++) {
      for (j=0; j < column_count; j++) {
         if (columns[j]) printf column_data[columns[j], i] ((j<column_count-1) ? "|" : "\n")
      }
   }
}
' infile

Ophiuchus · April 12, 2018, 2:52pm

I'm surprised when an awk script is written in that way, I almost don't understan the logic to customize it.

The record initiator will be always the line STAGE = 1
The Record terminator will be the empty line before the next Record Initiator. The issue is that in the original file the records with duplicate fields (rec 3) has empty lines before the occurence of next Record Initiator. And between Record Terminator and next Record Initiator (in green) could be more than one empty line and some garbaje lines (in red).

At the end of file appears and END string.

This would be a better representation of input file:

some garbaje
some garbaje
some garbaje

STAGE = 1
ID = 0
NAME = JFMSC
TYPE = MLRR
DFRUL = PERMISSION
ADDR = 1001
RRUL = PERMISSION
SPRR = TRUE
ISGALW = FALSE
ISUTWD = FALSE

some garbaje
some garbaje

STAGE = 1
ID = 2
NAME = PLLSJS
TYPE = MLRR
DFRUL = PERMISSION

some garbaje
some garbaje

STAGE = 1
ID = 4
NAME = AAAARR
TYPE = MLRR
DFRUL = RESTRICT
ADDR = 3553
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

ADDR = 66444
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

ADDR = 890087
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

some garbaje
some garbaje

STAGE = 1
ID = 0
NAME = PPROOA
TYPE = RRHN
DFRUL = PERMISSION
ADDR = 7034
RRUL = PERMISSION
SPRR = FALSE
ISGALW = FALSE
ISUTWD = FALSE

---    END

Ophiuchus · April 14, 2018, 10:57pm

The duplicated fields within a record should be printed in different lines in the output. In the output, the records that don't have fields repeated only have one line.

See below output (lines in green are the values for record 3) and input of my post #6.

STAGE|ID|NAME|TYPE|DFRUL|ADDR|RRUL|SPRR|ISGALW|ISUTWD
1|0|JFMSC|MLRR|PERMISSION|1001|PERMISSION|TRUE|FALSE|FALSE
1|4|AAAARR|MLRR|RESTRICT|3553|PERMISSION|FALSE|FALSE|FALSE
|||||66444|PERMISSION|FALSE|FALSE|FALSE
|||||890087|PERMISSION|FALSE|FALSE|FALSE
1|0|PPROOA|RRHN|PERMISSION|7034|PERMISSION|FALSE|FALSE|FALSE

Chubler_XL · April 15, 2018, 10:17pm

Update to rdrtx1's solution (in red) to cover the duplicate field(s) qualification.

awk '
BEGIN {lines=0; column_count=0}
$2 !~ /=/ || NF != 3 {next}
! column[$1]++ {columns[column_count++]=$1}
$1 ~ /^STAGE*$/ || $1 SUBSEP lines in column_data{lines++}
{column_data[$1, lines]=$3}
END {
   for (i=0; i<column_count; i++) if (columns) printf columns ((i<column_count-1) ? "|" : "\n")
   for (i=1; i <= lines; i++) {
      for (j=0; j < column_count; j++) {
         if (columns[j]) printf column_data[columns[j], i] ((j<column_count-1) ? "|" : "\n")
      }
   }
}
' infile

Ophiuchus · April 16, 2018, 1:28am

Excellent rdrtx1 for yout solution and Chubler_XL for the modification. With both solution it seems to work pretty nice even when in real file the strings have leading spaces.

Chubler_XL,

I see this solution has several arrays. Is there a tool to test and see what has stored each part of an awk program, the content of the arrays, etc?

Something similar to IRB(interactive Ruby) which we can test small parts of the ruby program. This in order to understand better the logic and how it works.

Thanks for help.

Chubler_XL · April 16, 2018, 7:42pm

I'm not aware of any tool like that. You could simply print out the contents of the arrays in the END block like this:

awk '
BEGIN {lines=0; column_count=0}
$2 !~ /=/ || NF != 3 {next}
! column[$1]++ {columns[column_count++]=$1}
$1 ~ /^STAGE*$/ || $1 SUBSEP lines in column_data{lines++}
{column_data[$1, lines]=$3}
END {
   for (i in column) print "column[" i "]="column
   for (i in columns) print "columns[" i "]="columns
   for (i in column_data) {
     k=i
     gsub(SUBSEP,",",k)
     print "column_data[" k "]="column_data
   }
}
' infile

result:

column[ADDR]=5
column[ISGALW]=5
column[RRUL]=5
column[TYPE]=4
...
columns[0]=STAGE
columns[1]=ID
columns[2]=NAME
columns[3]=TYPE
columns[4]=DFRUL
columns[5]=ADDR
...
column_data[NAME,6]=PPROOA
column_data[SPRR,1]=TRUE
column_data[ID,1]=0
column_data[SPRR,3]=FALSE
column_data[ID,2]=2
column_data[SPRR,4]=FALSE
...

You can see that:
column[] is index by the column name and contains a count of the number of rows the column appears in (less 1)
columns[] is indexed by the column order and contains the column name.
column_data[] is a 2D array with index of <column name>,<row> and contains value of the data for that cell.

column[] is only used to detect the first occurrence of a column and add it to columns[]

Ophiuchus · April 17, 2018, 1:15am

Excellent ChublerXL. Thanks for the lesson, specially that comes from someone that knows a lot. I'll play a little bit with this example of arrays.

Thanks to all.