Help with datafile parsing and creating spreadsheet

I have a datafile containing data in the following format

name1,employee_number1,cell1,home1,fax1
name2,employee_number2,cell2,home2,fax2
name3,employee_number3,cell3,home3,fax3
name4,employee_number4,cell4,home4,fax4
name5,employee_number5,cell5,home5,fax5
...
...
....

I would like to convert this file into a columnar format, so data for each employee shows up as a single column in excel. Something like the following :

name1,name2,name3,name4,name5 .........
<blank_line1>
<blank_line2>
<blank_line3>
<blank_line4>
<blank_line5>
employee_number1, employee_number2, employee_number3, employee_number4, employee_number5.....
cell1,cell2,cell3,cell4,cell5....
home1,home2,home3,home4,home5.......
fax1,fax2,fax3,fax4,fax5.....

thanks as always!

This is the transposition of matrix. Check the post of acp, an awesome script there.
www.unix.com/shell-programming-scripting/18897-row-column-transpose.html

1 Like

That was really helpful, however my requirement is slightly more complex :

Input Data file :

TEMP, STUDENT1 COURSE1,COURSE1_DETAILS,COURSE1_SCORE
TEMP, STUDENT1 COURSE1,COURSE1_MORE_DETAILS,COURSE1_COMMENTS
TEMP, STUDENT1 COURSE1,COURSE1_MORE_DETAILS1,COURSE1_COMMENTS1
TEMP, STUDENT1 COURSE1,COURSE1_MORE_DETAILS2,COURSE1_COMMENTS2
TEMP, STUDENT1, COURSE2,COURSE2_DETAILS,COURSE2_SCORE
TEMP, STUDENT1 COURSE2,COURSE2_MORE_DETAILS,COURSE2_COMMENTS
TEMP, STUDENT1 COURSE2,COURSE2_MORE_DETAILS1,COURSE2_COMMENTS1
TEMP, STUDENT1 COURSE2,COURSE2_MORE_DETAILS2,COURSE_COMMENTS2
..
..

TEMP, STUDENT2, COURSE1,COURSE1_DETAILS,COURSE1_SCORE
TEMP, STUDENT2, COURSE1,COURSE1_MORE_DETAILS,COURSE1_COMMENTS
..
TEMP, STUDENT2, COURSE2,COURSE2_DETAILS,COURSE2_SCORE
..
TEMP, STUDENT2, COURSE3,COURSE3_DETAILS,COURSE3_SCORE
..
..
PERM, STUDENT1 COURSE1,COURSE1_DETAILS,COURSE1_SCORE
PERM, STUDENT1 COURSE1,COURSE1_MORE_DETAILS,COURSE1_COMMENTS
PERM, STUDENT1 COURSE1,COURSE1_MORE_DETAILS1,COURSE1_COMMENTS1
PERM, STUDENT1 COURSE1,COURSE1_MORE_DETAILS2,COURSE1_COMMENTS2
PERM, STUDENT1, COURSE2,COURSE2_DETAILS,COURSE2_SCORE
PERM, STUDENT1 COURSE2,COURSE2_MORE_DETAILS,COURSE2_COMMENTS
PERM, STUDENT1 COURSE2,COURSE2_MORE_DETAILS1,COURSE2_COMMENTS1
PERM, STUDENT1 COURSE2,COURSE2_MORE_DETAILS2,COURSE_COMMENTS2
..
..

PERM, STUDENT2, COURSE1,COURSE1_DETAILS,COURSE1_SCORE
PERM, STUDENT2, COURSE1,COURSE1_MORE_DETAILS,COURSE1_COMMENTS
..
PERM, STUDENT2, COURSE2,COURSE2_DETAILS,COURSE2_SCORE
..
PERM, STUDENT2, COURSE3,COURSE3_DETAILS,COURSE3_SCORE
..
..

So there are 2 types of students, TEMP and PERM.
Each Student will have a variable number of courses. For each course, there are anywhere between 2-4 rows in the input file

I would like to create two output files one for TEMP students and another one for PERM students, with data for each student and all his/her various course details in a columnar format

  STUDENT1, STUDENT2, STUDENT3 ......STUDENTn
  COURSE1:COURSE1_DETAILS, COURSE1:COURSE1_DETAILS....
  COURSE1:COURSE1_MORE_DETAILS, COURSE1:COURSE1_MORE_DETAILS ...
  ..
  COURSE2:COURSE2_DETAILS, COURSE2:COURSE2_DETAILS....
  COURSE2:COURSE2_MORE_DETAILS, COURSE2:COURSE2_MORE_DETAILS ...
  ..
  COURSEn:COURSEn_DETAILS, COURSEn:COURSEn_DETAILS....
  COURSEn:COURSEn_MORE_DETAILS, COURSEn:COURSEn_MORE_DETAILS ...

So the goal is to have one column in Excel for each student, with details about each of his courses listed on various rows for that particular student's column in the spreadsheet

Then this is not a story to transpose the matrix and more complicated as the course number for the students may be different. And your output wants to combine every two original cols tagged with the course name.

TEMP, STUDENT1 A,something,Morething
TEMP, STUDENT1 B,anything,Morestuff
TEMP, STUDENT1 C,goodthing,Morebla
TEMP, STUDENT1 D,goodthing,Morebla2
TEMP, STUDENT2 A,something,Morething
TEMP, STUDENT2 B,anything,Morestuff
TEMP, STUDENT2 D,thing2,Morebla2

It seems to me you transfer the normalised table to an irregular one. Right?

      STUDENT1                              STUDENT2
A:something   A:Morething              A:something   A:Morething
B:anything    B:Morestuff              B:anything    B:Morestuff
C:goodthing   C:Morebla                D:thing2      D:Morebla2
D:goodthing2  D:Morebla2 

Not a good way for your next processing, if you have. Probably you need think over what you really want.

First of all, appreciate your time in helping me out.

Apologies for not making it clear. Let me start all over again and to make it easy, lets make prune some additional details and lay down some hard restrictions :

  1. Let's assume there is no more the student_type (TEMP or PERM) data point
  2. Let's assume each student takes precisely 8 courses (which have same course names) and for each course there are there are 6 variable pieces of information separated by commas on a single row. So this means for each student, there will be precisely 8 rows in the input file, containing student_name, course_name followed by 6 pieces of variable information for each student.
STUDENT1, COURSE1, COURSE1_VAR1, COURSE1_VAR2,..COURSE1_VAR6
STUDENT1, COURSE2, COURSE2_VAR1, COURSE2_VAR2,..COURSE2_VAR6
..
..
STUDENT1, COURSE8, COURSE8_VAR1, COURSE8_VAR2,..COURSE8_VAR6
..
..
STUDENTn, COURSE1,COURSE1_VAR1, COURSE1_VAR2,..COURSE1_VAR6
STUDENTn, COURSE2,COURSE2_VAR1, COURSE2_VAR2,..COURSE2_VAR6
..
..
STUDENTn, COURSE8,COURSE8_VAR1, COURSE8_VAR2,..COURSE8_VAR6

In the output, ALL data for each student should be represented in a single column in Excel.

STUDENT1, STUDENT2, STUDENT3 ......STUDENTn
COURSE1_VAR1, COURSE1_VAR1,.. COURSE1_VAR1..
COURSE1_VAR2, COURSE1_VAR2,.. COURSE1_VAR2...
..
COURSE1_VAR6, COURSE1_VAR6,.. COURSE1_VAR6
COURSE2_VAR1, COURSE2_VAR1,.. COURSE2_VAR1..
COURSE2_VAR2, COURSE2_VAR2,.. COURSE2_VAR2...
..
COURSE2_VAR6, COURSE2_VAR6,.. COURSE2_VAR6
..
COURSE8_VAR1, COURSE8_VAR1,.. COURSE8_VAR1..
COURSE8_VAR2, COURSE8_VAR2,.. COURSE8_VAR2...
..
COURSE8_VAR6, COURSE8_VAR6,.. COURSE8_VAR6

So as you can see the COURSE1 thru COURSE8 have now become the row labels and STUDENT names have become the column labels. So it is kind of transpose of matrix, in some ways.

Once again I appreciate your time!

Good! It's clearer than last time, and the situation is simpler than what I thought .For simplicity I changed the description to string. e.g. CS111 = Student 1 + Course1 + variable1 and so on.

ST1,CS1,CS111,CS112,CS113,CS114,CS115,CS116
ST1,CS2,CS121,CS122,CS123,CS124,CS125,CS126
...
ST4,CS7,CS471,CS472,CS473,CS474,CS475,CS476
ST4,CS8,CS481,CS482,CS483,CS484,CS485,CS486

By copying radoulov's code

awk -F"," 'END { for (A in _) print A"," _[A] }
{ _[$1] = $1 in _ ? _[$1] FS $0 : $0 }' input.data > tmp.data

Then by acp's code as transpose.awk:

awk -f transpose.awk tmp.data > output.data
    ST1    ST2    ST3    ST4    
    ST1    ST2    ST3    ST4    
    CS1    CS1    CS1    CS1    
    CS111    CS211    CS311    CS411    
    CS112    CS212    CS312    CS412    
    ......   
    CS116    CS216    CS316    CS416    
    ST1    ST2    ST3    ST4    
    CS2    CS2    CS2    CS2    
    CS121    CS221    CS321    CS421  
......  

Left the headers for each course and student for clarity, which can be removed easily in Excel.

1 Like

simply awesome!! you-da-man! works like a charm.

Many thanks for your time and help.