Transpose from row to column using timestamp in first column

jiam912 · August 7, 2018, 4:09pm

Gents,

Transpose from row to column, taking in consideration the first column, which contends the date.

Input file

72918,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009
72918,2356,2357,2358,2359,2360,2361,2362,2363,2364
72918,0,0,0,0,0,0,0,0,0
72918,0,0,0,0,0,0,1,0,0
72918,1496,1502,1752,1752,1752,1752,1751,974,972
73018,111000004,111000005,111000003,111000002,111000009,111000009,111000009,111000009,111000009
73018,2349,2350,2351,2352,2353,2354,2355,2356,2357
73018,0,0,0,0,0,0,0,0,0
73018,0,0,0,0,0,0,0,0,0
73018,1524,1526,1752,1752,1752,1752,1752,256,250

Output desired

72918,111000009,2356,0,0,1496
72918,111000009,2357,0,0,1502
72918,111000009,2358,0,0,1752
72918,111000009,2359,0,0,1752
72918,111000009,2360,0,0,1752
72918,111000009,2361,0,0,1752
72918,111000009,2362,0,1,1751
72918,111000009,2363,0,0,974
72918,111000009,2364,0,0,972
73018,111000004,2349,0,0,1524
73018,111000005,2350,0,0,1526
73018,111000003,2351,0,0,1752
73018,111000002,2352,0,0,1752
73018,111000009,2353,0,0,1752
73018,111000009,2354,0,0,1752
73018,111000009,2355,0,0,1752
73018,111000009,2356,0,0,256
73018,111000009,2357,0,0,250

With this code i get the output desired, but the number of fields in the input need to the same within each block

awk -F, '
 func init_block() {ts=$1;delete a;cnt=0;nf0=NF}
 func dump_block() {for(f=2;f<=nf0;f+=1){printf("%s",ts);for(r=1;r<=cnt;r+=1){printf(",%s",a[r,f])};print ""}}
 BEGIN{ts=-1}
 ts<0{init_block()}
 ts!=$1{dump_block();init_block()}
 {cnt+=1;for(f=1; f<=NF; f++) a[cnt,f]=$f}
 END{dump_block()}' <input.txt >output.txt

Is there other option code to get the results even if the fields are not the same?

Thanks in advance

RavinderSingh13 · August 7, 2018, 10:02pm

Hello jiam912,

Could you please try following and let me know if this helps.

awk -F, '
$0~/^[0-9]+,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009/ && array[1]{
   for(i=2;i<=NF;i++){
     print array[1],array
     delete array
   }
   delete array[1]
}
{
   if($0 ~ /^[0-9]+,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009/){
     array[1]=$1
   }
   for(i=2;i<=NF;i++){
     array=array?array OFS $i:$i
   }
}
END{
   if(array[1]){
     for(i=2;i<=NF;i++){
        print array[1],array
        delete array
     }
   }
}' OFS=,   Input_file

Thanks,
R. Singh

Don_Cragun · August 8, 2018, 2:38am

Hi jiam912,
The problem with giving us sample input with no specification of the format of that input is that we have to make assumptions about what the input format is. Ravinder's code assumes that the start of a block of input is delimited by a line with a field that "contends" a date followed by nine fields that contain the string 111000009 . And, from your sample data, that seems to fit your requirements perfectly.

If I was looking at your data, I might make the assumption that your input format contains groups of five lines. That also seems to fit your requirements perfectly given your sample data.

You say that you want code that works if the number of input fields is not a constant. But you don't specify what that means! Does the number of input fields vary in different sets of five lines while being constant within a set of five lines? Does the number of fields vary on each input line? (And, if so, what values are to be printed for fields that are not specified in the input?)

Did Ravinder guess correctly at how the first line of a group is identified? Or, are there a fixed number of lines in a group? If Ravinder didn't guess correctly but there are varying numbers of lines in a group, how are we supposed to determine which line starts a group?

jiam912 · August 8, 2018, 5:18pm

Hi Don.
Sorry for the confusion, I have changed a little the examples.
The fields after the date will be not always 9, in the example are 9 fields but can be more.
The code of Ravinders works perfectly with the example before, but not if the values change in the first line when the date changes. As i explain the value 111000009 is not constant.
Appreciate your help.
Please check the new input file
Many thanks.

Don_Cragun · August 8, 2018, 11:54pm

In post #3, I asked you six questions. You have now answered one of them:

you answered "No" to that question.

Please give us answers to the other questions so we have a chance of helping you figure out what needs to be done to reach your goal.

jiam912 · August 9, 2018, 4:32pm

Hi Don

Appreciate your answer.

I have attached 2 files with more data to be more clear.
File tmp1.txt ( input file )
File tmp2.txt ( desired output) -

7 fields separated by ,

As you can see in the input file the date block change every each 6 lines (fixed number of lines in a group) and the files after column1 (date) are not the same allways. So the pursose is to traspose the colums after date example in date 061218 after date, are 14 fields , then should have 14 lines like in the otput file attached.

But can be cases where the date does not change after 6 lines, can be each 12 like the example on date 061318 in that case the values of the second part lines 7-12 need to be added to the END ond the lines 1-6 . To get correctly in the output file.

My script works fine if the input has only fixed number of lines ( 6 for each block ), but if there is case as the date 061318 it does not work.

Hope you can help me to figer out this case.

Appreciate your help.

Don_Cragun · August 10, 2018, 11:04pm

Hi jiam912,
In your original sample input, there were groups of 5 lines (10 lines total in the sample). In your latest sample (attached to post #6), you say that there are groups of 6 lines (30 lines total in the sample).

I asked if there are always a constant number of fields in a group of lines. You have not answered that question.

I asked how we are supposed to determine which line is a first line of a group if there can be a different number of lines in a group. You have not answered that question. And, as pointed out in the first paragraph above, the number of lines in a group in your samples is not a constant. Without a clear answer to how you determine that a line is the first line of a group, I have no idea how we can help you solve your problem!

I asked whether or not the number of fields within a group of lines is a constant. You have not answered that question. The number is a constant in your samples, but without a clear specification of your input file format, I don't want to make any wild assumptions that will invalidate any suggestions we might provide. If the number of fields within a group is lines is not a constant, I asked what default values should be provided for fields that are not supplied in a group. You have not answered that question.

If you won't give us a clear specification of your input file format, don't expect us to waste any more time guessing at code that will likely be invalidated by your next post with a new sample input file format.

jiam912 · August 15, 2018, 5:27am

Hi Don

I changed to 6 lines the group, because my input sample changed a little. But always it will fixed to 6 lines the block.

The number of fields a group of lines is a constant. NO. it will be different in some cases.

As i say before the changes can be 6 or maximun 12 when the date does not change after 6 lines.

Sorry if dont aswer exactly your questions.

I figert out the issue using the 2 codes below..

gawk '
BEGIN{
   FS=OFS=","
}
prev!=$1 && prev{
   for(i=1;i<=count;i++){
     print prev,a[prev,i]
   }
   prev=count=""
}
{
   prev=$1
   sub(/[^,]*,/,"")
   if(count==6){
     count=1
   }
   else{
     count++
   }
   a[prev,count]=a[prev,count]?a[prev,count] OFS $0:$0
}
END{
   if(prev){
     for(i=1;i<=count;i++){
        print prev,a[prev,i]
     }
   }
}'  tmp1.txt > tmp1

gawk -F, '
 func init_block() {ts=$1;delete a;cnt=0;nf0=NF}
 func dump_block() {for(f=2;f<=nf0;f+=1){printf("%s",ts);for(r=1;r<=cnt;r+=1){printf(",%s",a[r,f])};print ""}}
 BEGIN{ts=-1}
 ts<0{init_block()}
 ts!=$1{dump_block();init_block()}
 {cnt+=1;for(f=1; f<=NF; f++) a[cnt,f]=$f}
 END{dump_block()}' < tmp1 > tmp.txt

Using this codes i got exactly the desired output.

I believe you can do the same output using a shot way code.

Thanks for your help.

Don_Cragun · August 16, 2018, 8:27am

Since you still haven't answered several of my questions, I hate to make wild guesses at what the format of your input files really is (and the following code absolutely will not work with the sample data you provided in post #1 in this thread), but as long as each set of lines contains 6 lines, every line in each 6 line set contains the same number of fields, and there are no empty input fields or missing lines in a 6 line set (as in the file tmp1.txt that you attached to post #6); the following produces the same output as was in the file tmp2.txt that you attached to that same post and seems to me to be easier to understand:

awk '
BEGIN {	FS = OFS = ","
}
{	for(i = 2; i <= NF; i++)
		d[NR % 6, i] = $i
}
!(NR % 6) {
	for(i = 2; i <= NF; i++)
		print $1, d[1, i], d[2, i], d[3, i], d[4, i], d[5, i], d[0, i]
}' tmp1.txt

This was tested with awk on macOS High Sierra version 10.13.6, but should also work with a current gawk on a Linux system. It will also work with /usr/xpg4/bin/awk or nawk on Solaris/SunOS systems.

jiam912 · August 17, 2018, 1:36am

Hi Don,

Your code works perfectly, kindly can you explain a little how it works.

Many thanks

RavinderSingh13 · August 17, 2018, 2:03am

Hello jiam912,

Could you please go through following and let us know if this helps you?

awk '                    ##Starting awk code from here.
BEGIN {	FS = OFS = ","   ##Starting BEGIN section of awk where setting FS and OFS value to comma for the Input_file.
}
{	for(i = 2; i <= NF; i++)  ##Starting a for loop which starts from i=2 value to till value of NF.
		d[NR % 6, i] = $i ##Creating an array d whose index is NR%6 means will be 0 on every line which is dividable with 6 and value of variable i as second index.
}
!(NR % 6) {                       ##Checking condition here if current line value if fully dividable with 6.
	for(i = 2; i <= NF; i++)  ##Starting a for loop again here from value of i from 2 to till value of NF in line.
		print $1, d[1, i], d[2, i], d[3, i], d[4, i], d[5, i], d[0, i]  ##Printing values of array d with indexes from 1 to 6 and i value as per loop.
}' tmp1.txt                       ##mentioning Input_file name here.

Thanks,
R. Singh

Don_Cragun · August 17, 2018, 3:08am

Hi jiam912,
I think Ravinder has accurately described how each line in my script works. The slightly higher level overview is that the first loop gathers data from each line in your file as it is read and saves that data into the d[] array.

The second loop runs every time six lines of data have been accumulated into the d[] array and prints the results you want from those six lines while transposing the data it found in those six lines.

Without saving or checking the data in the 1st field on each input line, it makes the wild assumption that the 1st field on each line will have the same value. Therefore, it doesn't need to save that value and can just use the value in the 1st field of the sixth line of each set as the value to be printed in the 1st field of each output line for that set.

jiam912 · August 17, 2018, 8:13am

Hi Don and R. Singh

Many thanks for the explanation. Highly Appreciated.