Transpose data from columns to lines for each event

cgkmal · January 13, 2009, 9:53pm

Hi everyone,

Maybe somebody could help me with this.

I have a text file showing in 2 columns registers of services used by customers in a comercial place.

The register for the use of any particular service begins with "EVENT" in column 1.
I would like to transpose the info for each block in one line. I mean, the different
words in column 1 will appear only once like a header, and the data in column 2
will appear below of its respective column in 1 line only.

Source file. (Not all blocks have the same registers in column 1, some have more than others)

EVENT                                
INTERNET CONNECTION                  
Date                       11/01/2009
Initial hour               07:30     
Number of users            27        
Average of use             32 min    
Final hour                 19:00     
 
EVENT                                
LOCAL CALL                           
Date                       11/01/2009
Initial hour               07:42     
Number of users            15        
Average of use             7 min     
Final hour                 16:11     
 
EVENT                                
INTERNATIONAL CALL                   
Date                       11/01/2009
Initial hour               09:14     
Number of users            21        
Average of use             5 min     
Final hour                 16:17     
 
EVENT                                
PRINTER USE                          
Date                       12/01/2009
Initial hour               07:30     
Number of users            23        
Average of pages printed   17        
Final hour                 19:00

I would like to tabulate it as follow

 
EVENT                   Date    Initial hour Number of users Average of use Average of pages printed  Final hour
INTERNET CONNECTION  11/01/2009    07:30             27          32 min                                   19:00   
LOCAL CALL           11/01/2009    07:42             15           7 min                                   16:11   
INTERNATIONAL CALL   11/01/2009    09:14             21           5 min                                   16:17   
PRINTER USE          12/01/2009    07:30             23                                  17               19:00

So far I know that If I use:

 
awk '/INTERNET CONNECTION/ { getline; print $2}' Input_1.txt
awk '/LOCAL CALL/ { getline; print $2}' Input_1.txt
awk '/INTERNATIONAL CALL/ { getline; print $2}' Input_1.txt
awk '/PRINTER USE/ { getline; print $2}' Input_1.txt

the result is the date (in column 2) for each register
11/01/2009
11/01/2009
11/01/2009
12/01/2009

But how can I follow to get want I want (the other columns below the respective header).

Thanks in advance for any help.

Best regards.

angheloko · January 13, 2009, 10:03pm

Hi cgkmal,

Do you expect the file to contain a dynamic list of headers? Or will the headers be fixed? meaning it will always have the same header every time?

If it is fixed then we could just hard code the header part as well as other parts of the table that are fixed (ie. values under the EVENT column)

It would be helpful if you could identify this fixed variables of the table (if there are any).

cgkmal · January 13, 2009, 10:13pm

angheloko,

Thanks for your answer. Yes, the words in column 1 that will become in headers, are always the same. The only this is that some Event blocks have less than others event blocks. I mean, not all blocks will have value
below some headers.

Many thanks for the help you can give me.

angheloko · January 14, 2009, 4:51am

Hi,

Sorry for the delay. I was pretty busy today...doing documentation (oh the pain! every programmer's nightmare). Anyway, here's a very crude implementation:

# Extract data only and separate into files
sed 's/^EVENT//g;s/^[A-Z][A-Z][A-Z]*/EVENT  &/g;/^ *$/d;s/ *$//g;s/   */|/g' input.txt > input2.txt
grep -n EVENT input2.txt | cut -d: -f1 | while read X; do
        START=$X
        ((END=X+5))
        sed -n "${START},${END}p" input2.txt > input2.txt.$X
done

# Compose the headers
sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud > headers.txt
grep -v [A-Z][A-Z][A-Z]* headers.txt > colheaders.txt
grep [A-Z][A-Z][A-Z]* headers.txt | sed '/^EVENT/d' > rowheaders.txt

# Create the unformatted output
LINE="EVENT     "`cat colheaders.txt | tr "\n" "\t"`
echo "$LINE" > output.txt
cat rowheaders.txt | while read X; do
        echo "---"
        echo "Row: $X"
        FILE=`grep "$X" input2.txt.* | cut -d: -f1`
        echo "File: $FILE"
        LINE="$X"
        cat colheaders.txt | while read Y; do
                echo "$Y"
                LINE="$LINE     "`awk ' BEGIN { FS="|" } $1==key { print $field } ' key="$Y" field=2 $FILE`
        done
        echo ">> $LINE"
        echo "$LINE" >> output.txt
done

# Make it pretty
awk '
BEGIN { FS="\t" }

{ printf ("%-22s%-11s%-13s%-16s%-15s%-26s%-10s\n", $1, $4, $6, $7, $3, $2, $5) }

' output.txt > output2.txt

Basically,
the input file in input.txt
the output file is output2.txt

and some temporary files - input2.txt*, headers.txt, colheaders.txt, rowheaders.txt (just rm them in the end of the script)

Anyway, had to rush it so I know that there could be other more simpler solutions but here you go... Try it yourself...

My output:

EVENT                 Date       Initial hour Number of users Average of use Average of pages printed  Final hour
INTERNATIONAL CALL    11/01/2009 09:14        21              5 min                                    16:17
INTERNET CONNECTION   11/01/2009 07:30        27              32 min                                   19:00
LOCAL CALL            11/01/2009 07:42        15              7 min                                    16:11
PRINTER USE           12/01/2009 07:30        23                             17                        19:00

summer_cherry · January 14, 2009, 5:37am

hi, a little difficult, hope can helop you.

1> convert your file into strict two column files

cvt.sh
sed -n '/EVENT/{
h
N
h
x
s/\n//
p
}
/EVENT/ !{
p
}' yourfile

2> use below perl script to process it

sub _exist{
	my($ref,$value)=(@_);
	my @arr=@{$ref};
	for(my $i=0;$i<=$#arr;$i++){
		return 1 if $arr[$i] eq $value;
	}
	return 0;
}
$/="\n\n";
open FH,"sh cvt.sh|";
my (%res,$n,@seq);
while(<FH>){
	my @arr=split("\n",$_);
	foreach(@arr){
		my @tmp=split(/  +/,$_);
		$res{$.}->{$tmp[0]}=$tmp[1];
		push @seq,$tmp[0] if (_exist(\@seq,$tmp[0])==0);
	}
	$n++;
}
close FH;
print ((join "        ",@seq),"\n");
map { printf("%20s",$_) } @sep;
for($i=1;$i<=$n;$i++){
	my %hash=%{$res{$i}};
	map {printf("%20s",$hash{$_})} @seq;
	print "\n";
}

output:

EVENT        Date        Initial hour        Number of users        Average of use        Final hour        Average of pages printed
 INTERNET CONNECTION          11/01/2009               07:30                  27              32 min               19:00     
          LOCAL CALL          11/01/2009               07:42                  15               7 min               16:11     
  INTERNATIONAL CALL          11/01/2009               09:14                  21               5 min               16:17     
         PRINTER USE          12/01/2009               07:30                  23                                   19:00                  17

cgkmal · January 14, 2009, 2:12pm

Hello angheloko and summer_cherry,

Many thanks for take some of your time to help me. I got some errors testing your codes. Explanations below.

angheloko,

May you help saying what I�m doing wrong or how did you send the scripts? because I see it works for you, and for me doesn�t show the complete answer.

If a leave only input.txt and your script within a folder and run it step by step the behaviour is as follow:

(1)
If I run the first script (# Extract data only and separate into files) looks good so far and generates:

input2.txt (adding some pipes)
input2.txt.10
input2.txt.18
input2.txt.2
input2.txt.26

(2)
If I run the 2nd script (# Compose the headers) looks good so far and generates:

colheaders.txt
headers.txt
rowheaders.txt

(3)
If I run the 3rd script (# Create the unformatted output) looks good so far and generates:

output.txt -->(looks like transpose column into line, but appears some squares in the format)

example:

	input.txt:INTERNET CONNECTION

	input.txt:Date

(4)
If I run the 4th script (# Make it pretty) generates:

output2.txt

and only shows

EVENT     input.txt:EVENT                                
input.txt:Initial hour               07:30     
input.txt:Average of use             32 min    
input.txt:Final hour                 19:00     
input.txt:Date                       11/01/2009
input.txt:INTERNET CONNECTION                  
input.txt:Number of users            27

(5)
If I run the complete script within a folder with other files in it I get an error.
(It�s not too relevant, I only isolated the files and ran it again, information only)

sed: "input.txt", line 30: warning: newline appended
sed: input.txt: cannot open [No such file or directory]
---
֤�SD�hrar:}' input.txt@t ��.nDj�{D].:3       CDRs_1.pl*��
          I_YG�W��!�@#a-���Ѭ��]�H�W�M�~ao7�Z//�
�z�%���"L�/��           ͰC-Q�~ffr?
File:        �/D*�K�Jb
CDRs_1.sh:cvt.sh
֤�SD�hr:}' input.txt@t ��.nDj�{D].:3 CDRs_1.pl*��
          I_YG�W��!�@#a-���Ѭ��]�H�W�M�~ao7�Z//�
�z�%���"L�/��           ͰC-Q�~ffr?
$            �/D*�K�Jb

summer_chery,

Thanks for your help, really. But I tryed to test it, the first part looks like work for me withot errors,
but when I try to send the second script I get

[root@trm72 cc]# ./script.pl
./script.pl: line 2: sub: command not found
./script.pl: line 3: syntax error near unexpected token `$ref,$value'
'/script.pl: line 3: `  my($ref,$value)=(@_);
[root@trm72 cc]#

May you help saying what I�m doing wrong or how did you send the scripts? because I see it works for you.

Thanks for your help again.

angheloko · January 14, 2009, 9:24pm

Hi cgkmal,

Could you post the flat file (source file) and the outputs of the script (input2.txt, etc...) so we can isolate what part that caused the error.

In my case, when I run the script:

input.txt (source file):

EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00

EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11

EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17

EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00

input2.txt (processed input.txt):

EVENT|INTERNET CONNECTION
Date|11/01/2009
Initial hour|07:30
Number of users|27
Average of use|32 min
Final hour|19:00
EVENT|LOCAL CALL
Date|11/01/2009
Initial hour|07:42
Number of users|15
Average of use|7 min
Final hour|16:11
EVENT|INTERNATIONAL CALL
Date|11/01/2009
Initial hour|09:14
Number of users|21
Average of use|5 min
Final hour|16:17
EVENT|PRINTER USE
Date|12/01/2009
Initial hour|07:30
Number of users|23
Average of pages printed|17
Final hour|19:00

input2.txt.(n) (separated records):

EVENT|INTERNET CONNECTION
Date|11/01/2009
Initial hour|07:30
Number of users|27
Average of use|32 min
Final hour|19:00

headers.txt (row and column headers):

Average of pages printed
Average of use
Date
EVENT
Final hour
INTERNATIONAL CALL
INTERNET CONNECTION
Initial hour
LOCAL CALL
Number of users
PRINTER USE

colheaders.txt (column headers):

Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users

rowheaders.txt (row headers):

INTERNATIONAL CALL
INTERNET CONNECTION
LOCAL CALL
PRINTER USE

output.txt (awk friendly table - tab-delimited):

EVENT   Average of pages printed        Average of use  Date    Final hour      Initial hour    Number of users
INTERNATIONAL CALL              5 min   11/01/2009      16:17   09:14   21
INTERNET CONNECTION             32 min  11/01/2009      19:00   07:30   27
LOCAL CALL              7 min   11/01/2009      16:11   07:42   15
PRINTER USE     17              12/01/2009      19:00   07:30   23

output2.txt (formatted output):

EVENT                 Date       Initial hour Number of users Average of use Average of pages printed  Final hour
INTERNATIONAL CALL    11/01/2009 09:14        21              5 min                                    16:17
INTERNET CONNECTION   11/01/2009 07:30        27              32 min                                   19:00
LOCAL CALL            11/01/2009 07:42        15              7 min                                    16:11
PRINTER USE           12/01/2009 07:30        23                             17                        19:00

Screen looks like this while running the script:

---
Row: INTERNATIONAL CALL
File: input2.txt.13
Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users
>> INTERNATIONAL CALL           5 min   11/01/2009      16:17   09:14   21
---
Row: INTERNET CONNECTION
File: input2.txt.1
Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users
>> INTERNET CONNECTION          32 min  11/01/2009      19:00   07:30   27
---
Row: LOCAL CALL
File: input2.txt.7
Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users
>> LOCAL CALL           7 min   11/01/2009      16:11   07:42   15
---
Row: PRINTER USE
File: input2.txt.19
Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users
>> PRINTER USE  17              12/01/2009      19:00   07:30   23

cgkmal · January 14, 2009, 10:42pm

Hi again angheloko,

I notice the code split input2.txt in 4 subfiles, because in this input file are 4 EVENT blocks, but a complete input file could be of thousands of EVENT blocks. How to avoid generate those input2.txt.$X subfiles considering
a large input file?

Thanks for your help again.

And well, now this is what I get.

input.txt

 
EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00
EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11
EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17
EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00

input2.txt

 
 
EVENT|INTERNET CONNECTION
Date|11/01/2009
Initial hour|07:30
Number of users|27
Average of use|32 min
Final hour|19:00
 
EVENT|LOCAL CALL
Date|11/01/2009
Initial hour|07:42
Number of users|15
Average of use|7 min
Final hour|16:11
 
EVENT|INTERNATIONAL CALL
Date|11/01/2009
Initial hour|09:14
Number of users|21
Average of use|5 min
Final hour|16:17
 
EVENT|PRINTER USE
Date|12/01/2009
Initial hour|07:30
Number of users|23
Average of pages printed|17
Final hour|19:00

headers.txt and rowheaders with 0 KB (totally blank)

colheaders.txt

input.txt:EVENT
input.txt:INTERNET CONNECTION
input.txt:Date                       11/01/2009
input.txt:Initial hour               07:30
input.txt:Number of users            27
input.txt:Average of use             32 min
input.txt:Final hour                 19:00
input.txt:
input.txt:EVENT
input.txt:LOCAL CALL
input.txt:Date                       11/01/2009
input.txt:Initial hour               07:42
input.txt:Number of users            15
input.txt:Average of use             7 min
input.txt:Final hour                 16:11
input.txt:
input.txt:EVENT
input.txt:INTERNATIONAL CALL
input.txt:Date                       11/01/2009
input.txt:Initial hour               09:14
input.txt:Number of users            21
input.txt:Average of use             5 min
input.txt:Final hour                 16:17
input.txt:
input.txt:EVENT
input.txt:PRINTER USE
input.txt:Date                       12/01/2009
input.txt:Initial hour               07:30
input.txt:Number of users            23
input.txt:Average of pages printed   17
input.txt:Final hour                 19:00
input2.txt:
input2.txt:EVENT|INTERNET CONNECTION
input2.txt:Date|11/01/2009
input2.txt:Initial hour|07:30
input2.txt:Number of users|27
input2.txt:Average of use|32 min
input2.txt:Final hour|19:00
input2.txt:
input2.txt:
input2.txt:EVENT|LOCAL CALL
input2.txt:Date|11/01/2009
input2.txt:Initial hour|07:42
input2.txt:Number of users|15
input2.txt:Average of use|7 min
input2.txt:Final hour|16:11
input2.txt:
input2.txt:
input2.txt:EVENT|INTERNATIONAL CALL
input2.txt:Date|11/01/2009
input2.txt:Initial hour|09:14
input2.txt:Number of users|21
input2.txt:Average of use|5 min
input2.txt:Final hour|16:17
input2.txt:
input2.txt:
input2.txt:EVENT|PRINTER USE
input2.txt:Date|12/01/2009
input2.txt:Initial hour|07:30
input2.txt:Number of users|23
input2.txt:Average of pages printed|17
input2.txt:Final hour|19:00
input2.txt.10:EVENT|LOCAL CALL
input2.txt.10:Date|11/01/2009
input2.txt.10:Initial hour|07:42
input2.txt.10:Number of users|15
input2.txt.10:Average of use|7 min
input2.txt.10:Final hour|16:11
input2.txt.18:EVENT|INTERNATIONAL CALL
input2.txt.18:Date|11/01/2009
input2.txt.18:Initial hour|09:14
input2.txt.18:Number of users|21
input2.txt.18:Average of use|5 min
input2.txt.18:Final hour|16:17
input2.txt.2:EVENT|INTERNET CONNECTION
input2.txt.2:Date|11/01/2009
input2.txt.2:Initial hour|07:30
input2.txt.2:Number of users|27
input2.txt.2:Average of use|32 min
input2.txt.2:Final hour|19:00
input2.txt.26:EVENT|PRINTER USE
input2.txt.26:Date|12/01/2009
input2.txt.26:Initial hour|07:30
input2.txt.26:Number of users|23
input2.txt.26:Average of pages printed|17
input2.txt.26:Final hour|19:00

output.txt

EVENT     input.txt:EVENT
 input.txt:INTERNET CONNECTION
 input.txt:Date                       11/01/2009
 input.txt:Initial hour               07:30
 input.txt:Number of users            27
 input.txt:Average of use             32 min
 input.txt:Final hour                 19:00
 input.txt:
 input.txt:EVENT
 input.txt:LOCAL CALL
 input.txt:Date                       11/01/2009
 input.txt:Initial hour               07:42
 input.txt:Number of users            15
 input.txt:Average of use             7 min
 input.txt:Final hour                 16:11
 input.txt:
 input.txt:EVENT
 input.txt:INTERNATIONAL CALL
 input.txt:Date                       11/01/2009
 input.txt:Initial hour               09:14
 input.txt:Number of users            21
 input.txt:Average of use             5 min
 input.txt:Final hour                 16:17
 input.txt:
 input.txt:EVENT
 input.txt:PRINTER USE
 input.txt:Date                       12/01/2009
 input.txt:Initial hour               07:30
 input.txt:Number of users            23
 input.txt:Average of pages printed   17
 input.txt:Final hour                 19:00 input2.txt:
 input2.txt:EVENT|INTERNET CONNECTION
 input2.txt:Date|11/01/2009
 input2.txt:Initial hour|07:30
 input2.txt:Number of users|27
 input2.txt:Average of use|32 min
 input2.txt:Final hour|19:00
 input2.txt:
 input2.txt:
 input2.txt:EVENT|LOCAL CALL
 input2.txt:Date|11/01/2009
 input2.txt:Initial hour|07:42
 input2.txt:Number of users|15
 input2.txt:Average of use|7 min
 input2.txt:Final hour|16:11
 input2.txt:
 input2.txt:
 input2.txt:EVENT|INTERNATIONAL CALL
 input2.txt:Date|11/01/2009
 input2.txt:Initial hour|09:14
 input2.txt:Number of users|21
 input2.txt:Average of use|5 min
 input2.txt:Final hour|16:17
 input2.txt:
 input2.txt:
 input2.txt:EVENT|PRINTER USE
 input2.txt:Date|12/01/2009
 input2.txt:Initial hour|07:30
 input2.txt:Number of users|23
 input2.txt:Average of pages printed|17
 input2.txt:Final hour|19:00 input2.txt.10:EVENT|LOCAL CALL
 input2.txt.10:Date|11/01/2009
 input2.txt.10:Initial hour|07:42
 input2.txt.10:Number of users|15
 input2.txt.10:Average of use|7 min
 input2.txt.10:Final hour|16:11
 input2.txt.18:EVENT|INTERNATIONAL CALL
 input2.txt.18:Date|11/01/2009
 input2.txt.18:Initial hour|09:14
 input2.txt.18:Number of users|21
 input2.txt.18:Average of use|5 min
 input2.txt.18:Final hour|16:17
 input2.txt.2:EVENT|INTERNET CONNECTION
 input2.txt.2:Date|11/01/2009
 input2.txt.2:Initial hour|07:30
 input2.txt.2:Number of users|27
 input2.txt.2:Average of use|32 min
 input2.txt.2:Final hour|19:00
 input2.txt.26:EVENT|PRINTER USE
 input2.txt.26:Date|12/01/2009
 input2.txt.26:Initial hour|07:30
 input2.txt.26:Number of users|23
 input2.txt.26:Average of pages printed|17
 input2.txt.26:Final hour|19:00

output2.txt

EVENT     input.txt:EVENT
input.txt:Initial hour               07:30
input.txt:Average of use             32 min
input.txt:Final hour                 19:00
input.txt:Date                       11/01/2009
input.txt:INTERNET CONNECTION
input.txt:Number of users            27

angheloko · January 14, 2009, 10:49pm

Hi chkmal,

Like I said earlier, the solution was rushed. I do realize that this will not be the perfect solution. Anyway, the fault is in the creation of the headers which is why the succeeding steps failed. Let me try to get back to you later with a better solution.

For the mean time, how about cherry's solution?

You do have perl in your machine, right?

cgkmal · January 15, 2009, 12:49am

Hello angheloko,

Well, it�s ok. I�ll wait, no problem, mean while I�m the most interested and would like to contribute with ideas.

I was thinking an algorithm, but � can�t translate it to shell script, awk,
I�m very new with this of awk, or unix programming.

Something like.

1-) Put in column 2 in the same line, the word that is below "EVENT", for
example, "LOCAL CALL", "PRINTER USE", etc.

2-) Get unique values from column 1 and transpose them like headers
columns, beginnig the headers position in column 2 in the transposed
arrangement.

I�ve been trying doing my first steps to get unique values with some
tips from web examples of course.

 
#Extract first column
awk '{print $1}' input.txt > input1.txt
 
#Extract unique values in column counting frequency
cat input1.txt | sort | uniq -c | sort -n | more > output.txt

3-) Make a loop for every block that begins with "EVENT" and
transpose the values in column 2, putting them below the
respective header.

*The info for every block put before in vertical way, would pass to stay in horizontal way.
*The existent relation for values in column 1 and 2 in the same line,
would pass to be a relation of values in line 1(hearders line) and the line
X, in the same column.

I hope be an idea with some sense.

Related to your question about cherry�s solution, I�ve tryed and fails to me, I�m not sure why, I�m using UWIN,
a unix emulator for windows. I�ve been trying some examples of perl basic commands (like "Hello world") and seems to be working and being able to receive perl commands.

 
/c
$ print "Hello World.\n";
Hello World.
$

With summer_cherry�s script I get the error.

$ cherry.pl
cherry.pl[1]: sub: not found [No such file or directory]
cherry.pl: syntax error at line 2: `(' unexpected
$ sub ?
-ksh: sub: not found [No such file or directory]
$ sub help
-ksh: sub: not found [No such file or directory]
$ man sub
man page for sub not found
$

Well, will see what happens, I�ll continue trying over here, many thanks for your kind assistance so far.

Best regards.

angheloko · January 15, 2009, 2:02am

Hi cg,

What you suggested is very close to my first algo so we can go with that.

I just don't know why headers.txt didn't formed as expected.

Anyway, please see codes below, test it, and post the o/p (we may be getting different o/p(s):

This will get all the required headers

sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud

This should return the row headers (first column):

sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud | grep [A-Z][A-Z][A-Z]*

And finally, this should return the column headers:

sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud | grep -v [A-Z][A-Z][A-Z]*

These are my results:

input.txt:

EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00
EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11
EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17
EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00

1st code o/p (get required headers):

Average of pages printed
Average of use
Date
EVENT
Final hour
INTERNATIONAL CALL
INTERNET CONNECTION
Initial hour
LOCAL CALL
Number of users
PRINTER USE

2nd code o/p (get row headers):

EVENT
INTERNATIONAL CALL
INTERNET CONNECTION
LOCAL CALL
PRINTER USE

3rd code o/p (get column headers):

Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users

Go try it and post your results. Then we can go from there.

cgkmal · January 15, 2009, 9:32pm

Hi angheloko,

I couln�t reply before.

Well, with your new codes I receive the same input.txt at the end, doing it code by code or put the 3 codes in a shell script togheter.
The first 2 codes dont seem to do anything when I run them. The third one shows the same input.txt as output.

Below what I get, step by step.

 
$ pwd
/c/Temp Folder
 
$ ls
anghel.sh   input.txt
 
$ sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud
 
sed: "input.txt", line 30: warning: newline appended
 
$ sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud | grep [A-Z][A-Z][A-Z]*
 
sed: "input.txt", line 30: warning: newline appended
 
$ sed 's/   */|/g;/^ *$/d' input.txt | awk -F"|" ' { print $1 } ' | sort | uniq -ud | grep -v [A-Z][A-Z][A-Z]*
 
sed: "input.txt", line 30: warning: newline appended
 
EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00
EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11
EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17
EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00
$

I think is not going like in your machine, what could be?

I�ll follow seeing what happens.

angheloko · January 15, 2009, 10:27pm

Hi cg,

Try the following instead:

awk ' BEGIN { FS="  " } { print $1 } ' foo | sort | sed '$!N; /^\(.*\)\n\1$/!P; D'

awk ' BEGIN { FS="  " } { print $1 } ' foo | sort | sed '$!N; /^\(.*\)\n\1$/!P; D' | grep ^[A-Z][A-Z]

awk ' BEGIN { FS="  " } { print $1 } ' foo | sort | sed '$!N; /^\(.*\)\n\1$/!P; D' | grep -v ^[A-Z][A-Z]

The reason we're doing this is to establish the required column and row headers and make it into an awk friendly table.

cgkmal · January 15, 2009, 10:41pm

Hi again angheloko,

I�ve tryed your last 3 lines and look like are workin in my pc now.

See my screen log when I sent the 3 codes one by one:

 
$ awk ' BEGIN { FS="  " } { print $1 } ' input.txt | sort | sed '$!N; /^\(.*\)\n\1$/!P; D'
 
Average of pages printed
Average of use
Date
EVENT
Final hour
INTERNATIONAL CALL
INTERNET CONNECTION
Initial hour
LOCAL CALL
Number of users
PRINTER USE
 
$ awk ' BEGIN { FS="  " } { print $1 } ' input.txt | sort | sed '$!N; /^\(.*\)\n\1$/!P; D' | grep ^[A-Z][A-Z]
 
EVENT
INTERNATIONAL CALL
INTERNET CONNECTION
LOCAL CALL
PRINTER USE
 
$ awk ' BEGIN { FS="  " } { print $1 } ' input.txt | sort | sed '$!N; /^\(.*\)\n\1$/!P; D' | grep -v ^[A-Z][A-Z]
 
Average of pages printed
Average of use
Date
Final hour
Initial hour
Number of users
$

As you can see, it�s works so far.

Many thanks for your help.

angheloko · January 16, 2009, 1:26am

Good! Now try this one and see if it works:

awk '
BEGIN { RS="EVENT"; FS="\n"; cols[0]="EVENT"; totalcols=0; rowno=0 } 
$2 != "" {
    vals[0] = $2
    for (i = 3; i <= NF; i++) {
    
        # Extract column name
        col = substr($i, 1, index($i, "  "))
        sub("^ *", "", col); sub(" *$", "", col)
    
        # See if column already existing
        found = 0
        for (colno = 0; colno <= totalcols; colno++) 
            if ( cols[colno] == col ) found = 1
        
        # If not, set position
        if ( found == 0 ) {
            totalcols++
            colno = totalcols
            cols[colno] = col
        }
        
        # Extract the value only
        val = substr($i, length(col) + 1)
        sub("^ *", "", val); sub(" *$", "", val)
        vals[colno] = val
    }
    
    for (i = 0; i <= totalcols; i++) {
        line[rowno] = line[rowno]","vals
    }
    rowno++;
}
END { 
    for (i = 0; i <= totalcols; i++)
        header=header","cols
    print header
    for (i = 0; i < rowno; i++)
        print line
}
' input.txt | sed 's/^,//g'

cgkmal · January 16, 2009, 1:41am

angheloko,

Running your last code over the original input.txt the result is:

EVENT,Initial hour,Number of users,Average of use,Final hour,,Date,Average of pages printed
INT
Date                       11/01/2009,07:30,27,32 min,19:00,
LOCAL CALL,07:30,27,32 min,19:00,,11/01/2009
INT,07:30,27,32 min,19:00,,11/01/2009
Date                       11/01/2009,07:30,27,32 min,19:00,,11/01/2009
PRINT,07:30,27,32 min,19:00,,11/01/2009
Date                       12/01/2009,07:30,27,32 min,19:00,,11/01/2009,17

angheloko · January 16, 2009, 2:22am

Wasn't expecting that. Anyway, I changed it a little. Try it again.

awk '
BEGIN { RS="EVENT"; FS="\n"; cols[0]="EVENT"; totalcols=0; rowno=0 } 
$2 != "" {
	vals[0] = $2

	for (i = 3; i <= NF; i++) {
	
		# Extract column name
		col = substr($i, 1, index($i, "  "))
		sub("^ *", "", col); sub(" *$", "", col)
	
		# See if column already existing
		found = 0
		for (colno = 0; colno <= totalcols; colno++) 
			if ( cols[colno] == col ) found = 1
		
		# If not, set position
		if ( found == 0 && col != "" ) {
			totalcols++
			colno = totalcols
			cols[colno] = col
			print colno": "col
		}
		
		# Extract the value only
		val = substr($i, length(col) + 1)
		sub("^ *", "", val); sub(" *$", "", val)
		vals[colno] = val
	}
	
	for (i = 0; i <= totalcols; i++) {
		line[rowno] = line[rowno]","vals
	}
	rowno++;
}
END { 
	for (i = 0; i <= totalcols; i++)
		header=header","cols
	print header
	for (i = 0; i < rowno; i++)
		print line
}
' input.txt | sed 's/^,//g'

input.txt is:

EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00
EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11
EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17
EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00

O/P is:

EVENT,Date,Initial hour,Number of users,Average of use,Final hour,Average of pages printed
INTERNET CONNECTION,11/01/2009,07:30,27,32 min,19:00
LOCAL CALL,11/01/2009,07:30,27,32 min,19:00
INTERNATIONAL CALL,11/01/2009,07:30,27,32 min,19:00
PRINTER USE,11/01/2009,07:30,27,32 min,19:00,17

cgkmal · January 16, 2009, 2:29am

Well,

Now I receive this:

 
1: Initial hour
2: Number of users
3: Average of use
4: Final hour
5: Date
6: Average of pages printed
EVENT,Initial hour,Number of users,Average of use,Final hour,Date,Average of pages printed
07:30,27,32 min,19:00
11/01/2009
 
17

I confused why you get different results with the same code.

It could be something in format of both columns?

Well, thanks for your help:b:

angheloko · January 16, 2009, 4:25am

It's the machine. If we had the same OS this would have been solved earlier

Anyway, made some changes again and got the desire results with this one. Go try and post the results.

awk '
BEGIN { RS="EVENT"; FS="\n"; cols[0]="EVENT"; totalcols=0; rowno=0 }

$2 != "" {
    line=line"EVENT  "$2
    for (i=3; i<NF; i++) {
        line=line","$i
        
        # Extract column name
        col = substr($i, 1, index($i, "  "))
        sub("^ *", "", col); sub(" *$", "", col)
    
        # See if column already existing
        found = 0
        for (j = 0; j <= totalcols; j++) 
            if ( cols[j] == col ) found = 1
        
        # If not, set position
        if ( found == 0 && col != "" ) {
            totalcols++
            j = totalcols
            cols[j] = col
        }
    }
    line=line"|"
}

END {
    for (i = 0; i <= totalcols; i++)
        header=header","cols
    print header
    
    # Split into records
    top=split(line, records, "|")
    for (i=1; i<top; i++) {
    
        # Split into fields
        top2=split(records, fields, ",")
        for (j=0; j<=totalcols; j++) {
            # Extract column name
            col = substr($i, 1, index($i, "  "))
            sub("^ *", "", col); sub(" *$", "", col)
            
            found=0
            for (k=0; k<=top2; k++) {
                
                # Extract column name
                col = substr(fields[k], 1, index(fields[k], "  "))
                sub("^ *", "", col); sub(" *$", "", col)
            
                #print ">"cols[j]": "fields[k]": "col
                if (cols[j] == col) {
                    # Extract the value only
                    val = substr(fields[k], length(col) + 1)
                    sub("^ *", "", val); sub(" *$", "", val)
                    row=row","val
                    found=1
                    #print "found: "val
                }
            }
            if (found==0) row=row","
        }
        print row
        row=""
    }
}
' input.txt | sed 's/^,//g' > input.txt.tmp

# Make it pretty - This is where you'll make adjustments for the output
awk '
BEGIN { FS="," }

{ printf ("%-22s%-11s%-13s%-16s%-15s%-11s%s\n", $1, $2, $3, $4, $5, $6, $7) }

' input.txt.tmp

input.txt:

EVENT
INTERNET CONNECTION
Date                       11/01/2009
Initial hour               07:30
Number of users            27
Average of use             32 min
Final hour                 19:00
EVENT
LOCAL CALL
Date                       11/01/2009
Initial hour               07:42
Number of users            15
Average of use             7 min
Final hour                 16:11
EVENT
INTERNATIONAL CALL
Date                       11/01/2009
Initial hour               09:14
Number of users            21
Average of use             5 min
Final hour                 16:17
EVENT
PRINTER USE
Date                       12/01/2009
Initial hour               07:30
Number of users            23
Average of pages printed   17
Final hour                 19:00

cgkmal · January 16, 2009, 2:36pm

Hi angheloko,

I�ve tryed, and looks better each time. This time I can see 3 things:

1- It�s putting "Date" like row header and it has to be like column header.

2- The "words" of column 1 that will become in row headers are being splitted, for example "INTERNET CONNECTION"
in the output.txt only appears like "INT" and the same for the other.

3-The column headers are being joined with the next column header in some cases,
example: "Initial hourNumber of usersAverage of use"

This is my new output for me in this time:

 
EVENT                 Initial hourNumber of usersAverage of use  Final hour     Date       Average of pages printed
INT                                                                                     
Date                       11/01/200907:30      27           32 min          19:00                     
LOCAL CALL            07:42      15           7 min           16:11          11/01/2009 
INT                                                                                     
Date                       11/01/200909:14      21           5 min           16:17                     
PRINT                                                                                   
Date                       12/01/200907:30      23                           19:00                     17

a question:

In the code I see that looks variable, if the lines within blocks are more than for or 5 the code will process it or it�s taking
fixed number of column headers and row headers?

Very appreciated your help angheloko, and I�m learning a little bit each time with your help.

Best regards.