Data manipulation, Please help..

liuzhencc · March 24, 2016, 6:13am

Hello,

I have a huge set of data that needs to be reformatted.
Here is a simple example to explain the process.
I have number n=5 and a input with many numbers separated with comma:

0.49876577,-0.03160753,0.60992502,0.00361167,0.01401017,0.54274066,-0.04881392,-0.01454749,0.02569629,0.04952051,-0.01458124,-0.14468019,0.13155655,0.01912278,0.15125587

after the process, the output looks like:

   1          0.49876577
   2         -0.03160753          0.60992502
   3          0.00361167          0.01401017          0.54274066
   4         -0.04881392         -0.01454749          0.02569629          0.04952051
   5         -0.01458124         -0.14468019          0.13155655          0.01912278
   5          0.15125587

That starts from 1 and end with 5 (max n) with maximum 4 numbers in one line. So, if the n >4, that means we need more lines to print all numbers relating to n. Here, 5 is 4+1; if n goes up to 9, that will be three lines with 4+ 4+1. all these three lines begin with number n and followed by the numbers from the input.

If n is small, we could do it simply by hand. However, if n goes up to 100, it's a nightmare to manipulate.

Thanks so much for your kind help!

Zhen

Here is a link for an input with n=45 and the corresponding output

http://s000.tinyupload.com/?file_id=83062136554540326051
http://s000.tinyupload.com/?file_id=26820665341591171842

RudiC · March 24, 2016, 6:34am

Any attempts/ideas/thoughts from your side?

liuzhencc · March 24, 2016, 6:38am

Hi RudiC, thanks for reply.
I think it could be done by a loop combining with awk.
However, I'm not a skillful user with awk.
I have tried for several hours without any progress. The difficulty comes from the limit of 4 numbers when n> 4. I cannot figure out any clue on this. Probably, FORTRAN or other advanced language is a better choice to do this.
Zhen

RavinderSingh13 · March 24, 2016, 6:49am

Hello liuzhencc,

Could you please try following and let me know if this helps you. Let's say we have following Input_file where I have added few lines to test it better.

 cat Input_file
 .49876577,-0.03160753,0.60992502,0.00361167,0.01401017,0.54274066,-0.04881392,-0.01454749,0.02569629,0.04952051,-0.01458124,-0.14468019,0.13155655,0.01912278,0.15125587,.49876577,-0.03160753,0.60992502,0.00361167

Then following is the code for same.

awk -vn=5 -F"," '{num=split($0, A,",");for(i=1;i<=num;i++){for(k=1;k<=i;k++){q++;Q=Q?Q OFS A[q]:A[q]};if(Q){i=i>n?n:i;print i OFS Q;Q=""}}}'   Input_file

Output will be as follows.

1 .49876577
2 -0.03160753 0.60992502
3 0.00361167 0.01401017 0.54274066
4 -0.04881392 -0.01454749 0.02569629 0.04952051
5 -0.01458124 -0.14468019 0.13155655 0.01912278 0.15125587
5 .49876577 -0.03160753 0.60992502 0.00361167

If above doesn't meet your requirements completely then please post more meaningful Input_file with more specific requirement details on same too.
EDIT: Adding a non-one liner form of solution as follows now.

awk -vn=5 -F"," '{
                        num=split($0, A,",");
                                                for(i=1;i<=num;i++){
                                                                        for(k=1;k<=i;k++){
                                                                                                q++;
                                                                                                Q=Q?Q OFS A[q]:A[q]
                                                                                         };
                                                                        if(Q)            {
                                                                                                i=i>n?n:i;
                                                                                                print i OFS Q;
                                                                                                Q=""
                                                                                         }
                                                                   }
                 }
                '   Input_file

Thanks,
R. Singh

liuzhencc · March 24, 2016, 6:57am

ravindersingh13:

Hello liuzhencc,

Could you please try following and let me know if this helps you.
awk -vn=5 -F"," '{num=split($0, A,",");for(i=1;i<=num;i++){for(k=1;k<=i;k++){q++;Q=Q?Q OFS A[q]:A[q]};if(Q){i=i>n?n:i;print i OFS Q;Q=""}}}'   Input_file
Output will be as follows.
1 .49876577
2 -0.03160753 0.60992502
3 0.00361167 0.01401017 0.54274066
4 -0.04881392 -0.01454749 0.02569629 0.04952051
5 -0.01458124 -0.14468019 0.13155655 0.01912278 0.15125587.49876577
5 -0.03160753 0.60992502 0.00361167
 
If above doesn't meet your requirements completely then please post more meaningful Input_file with more specific requirement details on same too.

Thanks,
R. Singh

Hi Singh, thanks for the reply.
It does the job partially because the maximum numbers in one line must be limited to 4. That means the n=1 with 1 number; n=2 with 2; n=3 with 3; n=4 with 4; when n > 4, there will be multiple line printed with 4 numbers in one line. For example, if n=6, that prints "6 x x x x";next line "6 x x".
I posted a large input and output here:
TinyUpload.com - best file hosting solution, with no limits, totaly free TinyUpload.com - best file hosting solution, with no limits, totaly free it's too many numbers to paste here.
Thanks again for your help!
ZHen

RudiC · March 24, 2016, 7:10am

I don't get the correlation between the n number, the field count per line, the number printed in front of every line, and the count of input elements. Please supply the logics.

liuzhencc · March 24, 2016, 7:44am

Sorry for my English.
The logic is not complex. if we have a number n, the we have 1+2+3+...+n numbers in the input file separated by comma. the output is formatted as follows if n =10,

1 num1
2 num2 num3
3 num4 num5 num6
4 num7 num8 num9 num10
5 num11 num12 num13 num14
5 num15
6 num16 num17 num18 num19
6 num20 num21
7 num22 num23 num24 num25
7 num26 num27 num28
8 num29 num30 num31 num32
8 num33 num34 num35 num36
9 num37 num38 num39 num40
9 num41 num42 num43 num44
9 num45
10 num46 num47 num48 num49
10 num50 num51 num52 num53
10 num54 num55

because we have 1+2+3+4+5+6+7+8+9+10 == 55 numbers in the input.

input if n=10,

0.49876577,-0.03160753,0.60992502,0.00361167,0.01401017,0.54274066,-0.04881392,-0.01454749,0.02569629,0.04952051,-0.01458124,-0.14468019,0.13155655,0.01912278,0.15125587,0.02470641,0.13032155,-0.21400060,-0.02522136,-0.14202082,0.22603974,-0.10693609,0.10741174,0.02171361,0.00169569,-0.00450758,-0.00200352,0.11393663,0.11590056,-0.26243613,-0.04396195,0.00711393,-0.01215909,-0.00447005,-0.12119063,0.27560924,0.02263298,-0.04249325,-0.05785220,-0.01299558,0.02103291,0.00718894,-0.02543994,0.04643664,0.05459977,-0.09308653,-0.04780408,-0.08850503,0.00170852,0.00263105,0.00352394,0.00753284,0.00506342,0.01332940,0.09615459

output if n =10

   1          0.49876577
   2         -0.03160753          0.60992502
   3          0.00361167          0.01401017          0.54274066
   4         -0.04881392         -0.01454749          0.02569629          0.04952051
   5         -0.01458124         -0.14468019          0.13155655          0.01912278
   5          0.15125587
   6          0.02470641          0.13032155         -0.21400060         -0.02522136
   6         -0.14202082          0.22603974
   7         -0.10693609          0.10741174          0.02171361          0.00169569
   7         -0.00450758         -0.00200352          0.11393663
   8          0.11590056         -0.26243613         -0.04396195          0.00711393
   8         -0.01215909         -0.00447005         -0.12119063          0.27560924
   9          0.02263298         -0.04249325         -0.05785220         -0.01299558
   9          0.02103291          0.00718894         -0.02543994          0.04643664
   9          0.05459977
  10         -0.09308653         -0.04780408         -0.08850503          0.00170852
  10          0.00263105          0.00352394          0.00753284          0.00506342
  10          0.01332940          0.09615459

RudiC · March 24, 2016, 8:22am

OK, so that means create 1 - n lines numbered 1 - n, of 1 - n elements, but break each line at 4 elements, and repeat the line number in front of each broken sub line?
What if the count of input elements is less or more than 55, in above case?

liuzhencc · March 24, 2016, 8:26am

It should be exact with 1+2+...+n, that says 55 for n=10 because it is generated by a quantum chemical calculation. If number of numbers are large than 55, then just neglect the numbers in the position beyond 55. Normally, again it should be exactly match. The hard problem for me is the break of each line at 4 when n > 4.
Thanks.

RudiC · March 24, 2016, 8:54am

Try

awk -F, -vn=10 '
        {for (i=1; i<=n; i++)   {printf "%s", i
                                 for (j=0; j<i; j++) printf "%s%+f", (j%4?"":j?ORS i:"") OFS, $(++c)
                                 printf RS
                                }
        }
' OFS="\t" file
1    +0.498766
2    -0.031608    +0.609925
3    +0.003612    +0.014010    +0.542741
4    -0.048814    -0.014547    +0.025696    +0.049521
5    -0.014581    -0.144680    +0.131557    +0.019123
5    +0.151256
6    +0.000000    +0.000000    +0.000000    +0.000000
6    +0.000000    +0.000000
7    +0.000000    +0.000000    +0.000000    +0.000000
7    +0.000000    +0.000000    +0.000000
8    +0.000000    +0.000000    +0.000000    +0.000000
8    +0.000000    +0.000000    +0.000000    +0.000000
9    +0.000000    +0.000000    +0.000000    +0.000000
9    +0.000000    +0.000000    +0.000000    +0.000000
9    +0.000000
10    +0.000000    +0.000000    +0.000000    +0.000000
10    +0.000000    +0.000000    +0.000000    +0.000000
10    +0.000000    +0.000000

liuzhencc · March 24, 2016, 9:00am

rudic:

Try

awk -F, -vn=10 '
   {for (i=1; i<=n; i++)   {printf "%s", i
   for (j=0; j<i; j++) printf "%s%+f", (j%4?"":j?ORS i:"") OFS, $(++c)
   printf RS
   }
   }
' OFS="\t" file
1    +0.498766
2    -0.031608    +0.609925
3    +0.003612    +0.014010    +0.542741
4    -0.048814    -0.014547    +0.025696    +0.049521
5    -0.014581    -0.144680    +0.131557    +0.019123
5    +0.151256

MAGIC!!!

(j%4?"":j?ORS i:"") OFS, $(++c)

This is the most difficult Part I never understand.. But it does the trick! Thanks so much for your help!
ZHen

RudiC · March 24, 2016, 9:26am

Try also this simplification:

awk -F, -vn=10 '
        {for (i=1; i<=n; i++)
           for (j=0; j<i; j++) printf "%s%+f%s", (j%4?"":(j?ORS:"") i) OFS, $(++c), (i-j-1)?"":ORS
        }
' OFS="\t" file

It's making intensive use of the so called "conditional operator" expr?expr:expr that, if the first expr evaluates to true, supplies the second's, else the third's result.