Data Normalization

meetsriharsha · February 17, 2013, 3:29am

Hi, there
Need help on rearranging the data.
I have data in the following format.

LAC

040
DN

24001001
EQN

920-
2-
0-
1
CAT

MS
OPTRCL

3
TRARSTR

ACTTRACL
TRACLACT
TRACLMOD
LNATT

PB
COS

CLIP

Immediate row before the symbol "=" is like Column heading. The rows after "=" are like corresponding values.

I need to rearrange the data as follows.

LAC|DN|EQN|CAT|OPTRCL|TRARSTR|LNATT|COS
040|24001001|920- 2- 0- 1|MS|3|ACTTRACL TRACLACT TRACLMOD|PB|CLIP

Please help me out.
Thanks

Don_Cragun · February 17, 2013, 4:39am

Since you didn't use code tags, I'm not sure that I have correctly interpreted what your input looks like, but this might work:

awk '
/=/ {   if(hc) f[hc] = s
        h[++hc] = last
        last = s = ""
        lc = fc = 0
        next
}
{       if(lc) {
                if(fc)  s = s " " last
                else {  fc = 1
                        s = last
                }
        } 
        lc = 1
        last = $0
}
END {   for(i = 1; i <= hc; i++) printf("%s%s", h, i == hc ? "\n" : "|")
        if(lc && sc) f[hc] = s " " last
        else    f[hc] = last
        for(i = 1; i <= hc; i++) printf("%s%s", f, i == hc ? "\n" : "|")
}' data

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .

RudiC · February 17, 2013, 5:21am

PLEASE use code tags for code and data as required by forum rules!

This has become uglier than what I went for, but at least for your special problem, it might work:

awk     '$1=="=" {header = header M1 "|"        # if "=" encountered, use "dragging" copy, i.e. last line's $1
                  row = row "|"                 # insert "|" separator into row
                  next                          # no further action on this line
                 }
                 {row = row " " $1; M1 = $1}    # collect everything into row; keep a "dragging" copy of $1 in M1
         END     {gsub (/ [^ ]*\| /, "|", row)  # remove header fields from row
                  sub  (/^\|/, "", row)         # remove leading "|" from row
                  sub  (/\|$/,"", header)       # remove trailing "|" from header
                  print header; print row}      # print it
        ' file
LAC|DN|EQN|CAT|OPTRCL|TRARSTR|LNATT|COS
040|24001001|920- 2- 0- 1|MS|3|ACTTRACL TRACLACT TRACLMOD|PB|CLIP

---------- Post updated at 11:21 ---------- Previous update was at 10:57 ----------

Well, this might be a bit more straightforward and easier to understand...

awk     '       {tot = tot " " $1}                      # collect everything into one long line
         END    {gsub (/ [^ ]* =/, "|&", tot)           # separate entries by "|"
                 n = split (tot, Ar1, "\| ")            # split line by "|", remember field count
                 for (i=2; i<=n; i++)                   # every field but first (artificial, empty field)
                        {split (Ar1, Ar2, " = ")     # split into header/row part
                         head = head (i>2?"|":_) Ar2[1] # and create header 
                         row  = row  (i>2?"|":_) Ar2[2] # and row (both avoiding leading "|")
                        }
                 print head; print row                  # print both
                }
        ' file

meetsriharsha · February 17, 2013, 6:07am

The second one is not giving proper output, but the first one does. Can you please explain the first code.?

RudiC · February 17, 2013, 6:12am

This is the output of the second proposal - what is wrong, what are you missing?

LAC|DN|EQN|CAT|OPTRCL|TRARSTR|LNATT|COS
040|24001001|920- 2- 0- 1|MS|3|ACTTRACL TRACLACT TRACLMOD|PB|CLIP

Added comments to above post...

meetsriharsha · February 17, 2013, 6:38am

For the second one, the output is as follows.

LAC|=|040||DN|=|24001001||EQN|=|920-|2-|0-|1||CAT|=|MS||OPTRCL|=|3||SUBTRCL|=|3||NUMCAL|=|1||ORIG1|=|1||ORIG2|=|1||TRARSTR|=|ACTTRACL|TRACLACT|TRACLMOD||LNATT|=|PB||COS|=|CLIP||ADDINF|=|KEYWORD||LAC|=|040||DN|=|24001003||EQN|=|1010-|1-|4-13||CAT|=|MS||OPTRCL|=|12||NUMCAL|=|1||ORIG1|=|7||ORIG2|=|7||BLK|=|ACCSPORI||TRARSTR|=|RSTSCI1|RSTSCI4||LNATT|=|PB||COS|=|CLIP||LAC|=|040||DN|=|24001004||EQN|=|920-|2-|5-|6||CAT|=|MS||OPTRCL|=

Anyways, thanks for your concern. I wrote a big script to rearrange this data.

rm output.txt 2> /dev/null;
heading="";
valu="";
preline="#";
outhead="";
lnum=0;
lines=$(cat har4.txt | wc -l);
#echo "No. of lines: "$lines;
limit=`expr $lines - 2`;
#echo "Limit is: "$limit;
while read line
do
lnum=`expr $lnum + 1`;
if [ $line == "LAC" ]; then
outhead=$outhead"|"$heading;
echo $outhead >> output.txt;
echo $valu >> output.txt;
outhead="";
#echo "In first if";
heading="LAC";
#echo "Heading: "$heading;
valu="";
else
if [ $line == "=" ]; then
#echo "in second if";
if [ $preline != "LAC" ]; then
heading=$heading"|"$preline;
fi
#echo "Heading: "$heading;
a=$(echo $valu | awk -F "|" '{print length($NF)+1;}');
#echo "a value: "$a;
valu=$(echo $valu | awk -v ln=$a '{print substr($0, 1, length($0)-ln)}');
#echo "value: "$valu;
fi
fi
if [ $line != "LAC" ] && [ $line != "=" ]; then
if [ $lnum -le $limit ]; then
#echo "Lnum is: "$lnum;
nexfstlineno=`expr $lnum + 1`;
nexscndlineno=`expr $lnum + 2`;
#echo "Next Fst No: "$nexfstlineno;
#echo "Next Scnd No: "$nexscndlineno;
nexfstline=$(sed -n "${nexfstlineno}{p;q;}" har4.txt);
nexscndline=$(sed -n "${nexscndlineno}{p;q;}" har4.txt);
#echo "Next first line: "$nexfstline;
#echo "Next second line: "$nexscndline;
if [ $nexscndline == "=" ] && [ $nexfstline != "=" ] && [ $preline != "=" ]; then
valu=$valu"#"$line;
else
if [ $nexscndline != "=" ] && [ $nexfstline != "=" ] && [ $preline != "=" ]; then
valu=$valu"#"$line;
#echo "Value is: "$valu;
else
#echo "in third if";
valu=$valu"|"$line;
#echo "in third if, value is: "$valu
fi
fi
else
valu=$valu"|"$line;
#echo "in third if, value is: "$valu;
fi
fi
preline=$line;
done < har4.txt
awk '{gsub("-#","-");gsub("#"," ");print}' output.txt > final_output.txt;

here har4.txt is the input file.

output is:

|LAC|DN|EQN|CAT|OPTRCL|NUMBCH|NUMCAL|ORIG1|ORIG2|LNATT|COS|ADDINF|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV
|040|24001279|900-1-1-5|IBA|1|2|2|95|1|MSN|INHTROFF STIPROT FCTPROT|MULTIEQN|CMSPEECH|CMUNRST|CMAUDIO|CMAUDI7K|TEL3K|TEL7K|FAX4|TTX64K|VIDEOTEX|VIDEOTEL
|LAC|DN|EQN|CAT|OPTRCL|NUMBCH|NUMCAL|ORIG1|ORIG2|LNATT|COS|ADDINF|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV
|040|24001280|900-1-1-5|IBA|1|2|2|95|1|MSN|INHTROFF STIPROT FCTPROT|MULTIEQN|CMSPEECH|CMUNRST|CMAUDIO|CMAUDI7K|TEL3K|TEL7K|FAX4|TTX64K|VIDEOTEX|VIDEOTEL
|LAC|DN|EQN|CAT|OPTRCL|NUMBCH|NUMCAL|ORIG1|ORIG2|LNATT|COS|ADDINF|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV
|040|24001281|900-1-2-4|IBA|3|2|2|1|1|MSN|INHTROFF MSNDEFDN STIPROT FCTPROT|MULTIEQN|CMSPEECH|CMUNRST|CMAUDIO|CMAUDI7K|TEL3K|TEL7K|FAX4|TTX64K|VIDEOTEX|VIDEOTEL
|LAC|DN|EQN|CAT|OPTRCL|NUMBCH|NUMCAL|ORIG1|ORIG2|LNATT|COS|ADDINF|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV|SERV
|040|24001282|900-1-2-4|IBA|1|2|2|95|1|MSN|INHTROFF STIPROT FCTPROT|MULTIEQN|CMSPEECH|CMUNRST|CMAUDIO|CMAUDI7K|TEL3K|TEL7K|FAX4|TTX64K|VIDEOTEX|VIDEOTEL

busyboy · February 18, 2013, 4:03am

rudic:

PLEASE use code tags for code and data as required by forum rules!

This has become uglier than what I went for, but at least for your special problem, it might work:

awk     '$1=="=" {header = header M1 "|"        # if "=" encountered, use "dragging" copy, i.e. last line's $1
   row = row "|"                 # insert "|" separator into row
   next                          # no further action on this line
   }
   {row = row " " $1; M1 = $1}    # collect everything into row; keep a "dragging" copy of $1 in M1
   END     {gsub (/ [^ ]*\| /, "|", row)  # remove header fields from row
   sub  (/^\|/, "", row)         # remove leading "|" from row
   sub  (/\|$/,"", header)       # remove trailing "|" from header
   print header; print row}      # print it
   ' file
LAC|DN|EQN|CAT|OPTRCL|TRARSTR|LNATT|COS
040|24001001|920- 2- 0- 1|MS|3|ACTTRACL TRACLACT TRACLMOD|PB|CLIP

---------- Post updated at 11:21 ---------- Previous update was at 10:57 ----------

Well, this might be a bit more straightforward and easier to understand...

awk     '       {tot = tot " " $1}                      # collect everything into one long line
   END    {gsub (/ [^ ]* =/, "|&", tot)           # separate entries by "|"
   n = split (tot, Ar1, "\| ")            # split line by "|", remember field count
   for (i=2; i<=n; i++)                   # every field but first (artificial, empty field)
   {split (Ar1, Ar2, " = ")     # split into header/row part
   head = head (i>2?"|":_) Ar2[1] # and create header 
   row  = row  (i>2?"|":_) Ar2[2] # and row (both avoiding leading "|")
   }
   print head; print row                  # print both
   }
   ' file

sub(/^ \|/,"", row)

is needed as well for this code.

summer_cherry · February 21, 2013, 4:10am

import re
arr=[]
col=[]
val=[]
with open("a.txt","r") as f:
 for line in f:
  line=re.sub("\n","",line)
  if line.find("=")<0:
   arr.append(line)
  else:
   if len(arr)==1:
     col.append(arr[0])
   else:
    value=" ".join(arr[:-1])
    val.append(value)
    col.append(arr[-1])
   arr=[]

val.append(arr[0])
print(" | ".join(col))
print(" | ".join(val))

Data Normalization

LAC

040 DN

24001001 EQN

920- 2- 0- 1 CAT

MS OPTRCL

3 TRARSTR

ACTTRACL TRACLACT TRACLMOD LNATT

PB COS