HELP need to split this line faster than cut-command

daytripper1021 · October 29, 2009, 2:24am

Hi,

A datafile containing lines such as below needs to be split:

500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME

I need to get the 2-5, 11-20, and 35-40 characters and I can do it via cut command.

cut -c 2-5 file > temp1.txt
cut -c 11-20 file > temp2.txt
cut -c 35-40 file > temp3.txt
paste -d"," temp1.txt temp2.txt temp3.txt > result.txt

The problem is, with the huge amount of data, the process is too slow. Can this be done faster via awk or sed command? Hope you can teach me.

Thanks!!!

ripat · October 29, 2009, 2:40am

Do you have GNU awk on your box?

pravin27 · October 29, 2009, 2:42am

hi ,
try this...

cat temp | awk '{print substr($1,2,5) ,"," substr($1,11,20),"," substr($1,35,40)}'

ghostdog74 · October 29, 2009, 2:45am

awk '{print substr($1,2,5) ,"," substr($1,11,20),"," substr($1,35,40)}' temp

Scrutinizer · October 29, 2009, 2:58am

The following code works in bash too and is pretty efficient.

#!/bin/ksh
while read line; do
  echo "${line:1:4},${line:10:10},${line:34:6}"
done<file>result.txt

If you need even more performance then perhaps an awk script would be even faster.

ripat · October 29, 2009, 3:01am

If you have GNU awk, it has a very efficient builtin to split fixed length record. Try this:

BEGIN{
  FIELDWIDTHS="1 5 4 10 14 6 1"
}
{
  print $2, $4, $6
}

With this test file:

$ cat f
000000000111111111122222222223333333333444444444455555555556
123456789012345678901234567890123456789012345678901234567890
500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME

The awk code returns:

$ awk -f split.awk f
00000 1111111112 333334
23456 1234567890 567890
00000 0093249168 000000

If you don't have Gawk, use the substr() in the post above but it will be less efficient than with the FIELDWIDTHS builtin.

Scrutinizer · October 29, 2009, 3:04am

The awks aren't working right. IMO they should be like this:

awk '{print substr($1,2,4)","substr($1,11,10)","substr($1,35,6)}' file > result.txt

ripat · October 29, 2009, 3:40am

Just to illustrate the power of FIELDWITHS I just did a benchmark on a test file with 1,192,310 lines.

gawk with substr():         0m5.071s
original awk with substr(): 0m4.738s
gawk with FIELDWIDTHS:      0m1.393s
mawk with substr():         0m0.871s

Boy that mawk thing is fast!

daytripper1021 · October 29, 2009, 3:46am

Wow thank you so much for your quick replies! I was able to shortern my shell script (removing cut commands) to use the following suggestion and it worked!

awk '{print substr($1,12,11)","substr($1,42,11)","substr($2,2,14)}' file

Thank you very much!

Btw, what's that mawk command? I can't find it on the man pages of my server (I guess it's not installed).

ripat · October 29, 2009, 3:52am

mawk is an optimized interpreter for the awk language.

mawk - pattern scanning and text processing language
Don't MAWK AWK - the fastest and most elegant big data munging language! - Brendan O'Connor's Blog

**HELP** need to split this line faster than cut-command

HELP need to split this line faster than cut-command