**HELP** need to split this line faster than cut-command

Hi,

A datafile containing lines such as below needs to be split:

500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME

I need to get the 2-5, 11-20, and 35-40 characters and I can do it via cut command.

cut -c 2-5 file > temp1.txt
cut -c 11-20 file > temp2.txt
cut -c 35-40 file > temp3.txt
paste -d"," temp1.txt temp2.txt temp3.txt > result.txt

The problem is, with the huge amount of data, the process is too slow. Can this be done faster via awk or sed command? Hope you can teach me.

Thanks!!!

Do you have GNU awk on your box?

hi ,
try this...

cat temp | awk '{print substr($1,2,5) ,"," substr($1,11,20),"," substr($1,35,40)}'
awk '{print substr($1,2,5) ,"," substr($1,11,20),"," substr($1,35,40)}' temp

The following code works in bash too and is pretty efficient.

#!/bin/ksh
while read line; do
  echo "${line:1:4},${line:10:10},${line:34:6}"
done<file>result.txt

If you need even more performance then perhaps an awk script would be even faster.

If you have GNU awk, it has a very efficient builtin to split fixed length record. Try this:

BEGIN{
  FIELDWIDTHS="1 5 4 10 14 6 1"
}
{
  print $2, $4, $6
}

With this test file:

$ cat f
000000000111111111122222222223333333333444444444455555555556
123456789012345678901234567890123456789012345678901234567890
500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME

The awk code returns:

$ awk -f split.awk f
00000 1111111112 333334
23456 1234567890 567890
00000 0093249168 000000

If you don't have Gawk, use the substr() in the post above but it will be less efficient than with the FIELDWIDTHS builtin.

The awks aren't working right. IMO they should be like this:

awk '{print substr($1,2,4)","substr($1,11,10)","substr($1,35,6)}' file > result.txt

Just to illustrate the power of FIELDWITHS I just did a benchmark on a test file with 1,192,310 lines.

gawk with substr():         0m5.071s
original awk with substr(): 0m4.738s
gawk with FIELDWIDTHS:      0m1.393s
mawk with substr():         0m0.871s

Boy that mawk thing is fast!

Wow thank you so much for your quick replies! I was able to shortern my shell script (removing cut commands) to use the following suggestion and it worked!

awk '{print substr($1,12,11)","substr($1,42,11)","substr($2,2,14)}' file

Thank you very much!

Btw, what's that mawk command? I can't find it on the man pages of my server (I guess it's not installed).

mawk is an optimized interpreter for the awk language.

mawk - pattern scanning and text processing language
Don't MAWK AWK - the fastest and most elegant big data munging language! - Brendan O'Connor's Blog