Fast way to split a tab delimited file

madhunk · May 18, 2006, 10:42am

I have searched the forum and tried different options. One of the options work but is very slow. The file has millions and millions of records.

It is a TAB delimited file which contains two types of records. Metadata and Detail records.

M PARTNER 8 LAST_BOOKED_DATE D YYYYMMDD
M PARTNER 8 TRIPS_YTD A 11 TRIPS_TOTAL
D NAME FIRST LAST 209 N SANBORN AVE
D NAME FIRST LAST 6997 COUNTY ROAD D

I need to split the file into two files by looking at the first character. All records that start with 'M' go into one file and all records that start with 'D' go into another file.

The following code works but it is too slow...Is there any fast way of accomplishing it?

#!/usr/bin/ksh

while read line
do
char=`echo "$line" | cut -c1`
if [ "$char" = "M" ]; then
echo "$line" >> M.txt
else
echo "$line" >> D.txt
fi
done < head10000.out

exit 0

Any help would be appreciated.

Thank You,
Madhu

Bab00shka · May 18, 2006, 11:15am

'Awk' will be the quick way to do this. Unfortunately I'm not an expert in the syntax, but from experience its much faster that a standard shell script.

Cheers
Helen

madhunk · May 18, 2006, 11:45am

Thank you Helen...

I believe something like this would work..

awk -v logfile=${1:-"stdin"} '{ print > logfile"-"$1 }' "$1"

But it is throwing an error...
awk: syntax error near line 1
awk: bailing out near line 1

Any awk experts out there to resolve this situation.

jonnywilkins · May 18, 2006, 11:50am

How fast does it need to be ? Is a simple grep faster than looping within a script ?

grep "^M" head10000.out > M.txt
grep "^D" head10000.out > D.txt

tmarikle · May 18, 2006, 11:56am

awk '{
    if ( $0 ~ /^M/) print >"M.txt"
    else print >"D.txt"
}' head10000.out

madhunk · May 18, 2006, 12:15pm

Thank you very much....

There are 13019984 records in the file.

5 M records and the rest of them are D records.

It took 10-11 mins to run the awk program...Is this a good standard?

Thanks again for all the help!

tmarikle · May 18, 2006, 12:20pm

21,700 records per second seems good to me.

You may be able to split the work into two tasks and improve performance.

awk '/^M/' head10000.out > M.txt &
awk '/^D/' head10000.out > D.txt &

dayanandra · May 18, 2006, 12:56pm

I believe grep would do it faster than awk. Try using johnywilkins suggestion and compare the time taken.

tmarikle · May 18, 2006, 1:18pm

awk seems to be faster on my system.

Test with 500 records:

#! /usr/bin/ksh

print "Single task awk"
time {
    > M.txt
    > D.txt
    nawk '{
        if ($0 ~ /^M/) print $0 >"M.txt"
        else print $0 >"D.txt"
    }' test.dat
}
ls -altr M.txt D.txt

print "Two task awk"
time {
    > M.txt
    > D.txt
    nawk '/^M/' test.dat >> M.txt &
    nawk '/^D/' test.dat >> D.txt &
    wait
}
ls -altr M.txt D.txt

print "4-way awk"
time {
    > M.txt
    > D.txt
    nawk 'NR <  250000 && /^M/' test.dat >> M.txt &
    nawk 'NR >= 250000 && /^M/' test.dat >> M.txt &
    nawk 'NR <  250000 && /^D/' test.dat >> D.txt &
    nawk 'NR >= 250000 && /^D/' test.dat >> D.txt &
    wait
}
ls -altr M.txt D.txt

print "Grep"
time {
    > M.txt
    > D.txt
    grep "^M" test.dat > M.txt &
    grep "^D" test.dat > D.txt &
    wait
}
ls -altr M.txt D.txt

results:

Single task awk

real    3m12.40s
user    0m4.69s
sys     0m9.63s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

Two task awk

real    0m14.12s
user    0m5.93s
sys     0m1.55s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

4-way awk

real    0m16.14s
user    0m10.52s
sys     0m2.48s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

Grep

real    0m22.70s
user    0m1.50s
sys     0m3.24s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

dayanandra · May 18, 2006, 1:44pm

tmarikle:

awk seems to be faster on my system.

Test with 500 records:

#! /usr/bin/ksh

print "Single task awk"
time {
   > M.txt
   > D.txt
   nawk '{
   if ($0 ~ /^M/) print $0 >"M.txt"
   else print $0 >"D.txt"
   }' test.dat
}
ls -altr M.txt D.txt

print "Two task awk"
time {
   > M.txt
   > D.txt
   nawk '/^M/' test.dat >> M.txt &
   nawk '/^D/' test.dat >> D.txt &
   wait
}
ls -altr M.txt D.txt

print "4-way awk"
time {
   > M.txt
   > D.txt
   nawk 'NR <  250000 && /^M/' test.dat >> M.txt &
   nawk 'NR >= 250000 && /^M/' test.dat >> M.txt &
   nawk 'NR <  250000 && /^D/' test.dat >> D.txt &
   nawk 'NR >= 250000 && /^D/' test.dat >> D.txt &
   wait
}
ls -altr M.txt D.txt

print "Grep"
time {
   > M.txt
   > D.txt
   grep "^M" test.dat > M.txt &
   grep "^D" test.dat > D.txt &
   wait
}
ls -altr M.txt D.txt

results:

Single task awk

real    3m12.40s
user    0m4.69s
sys     0m9.63s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

Two task awk

real    0m14.12s
user    0m5.93s
sys     0m1.55s
-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

4-way awk

real    0m16.14s
user    0m10.52s
sys     0m2.48s-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

Grep

real    0m22.70s
user    0m1.50s
sys     0m3.24s-rw-r--r--   1 ... 34770850 ... D.txt
-rw-r--r--   1 ... 46222065 ... M.txt

How do you find the time taken to execute the script?

madhunk · May 18, 2006, 1:57pm

I have the results like this with the same file..This is a Sunsolaris machine..

Single task awk

real 2m21.59s
user 1m11.76s
sys 0m58.18s
-rw-r--r-- 1 develop 722 May 18 13:48 M.txt
-rw-r--r-- 1 develop 3055795862 May 18 13:50 D.txt
Two task awk

real 1m4.71s
user 1m25.91s
sys 0m33.18s
-rw-r--r-- 1 develop 3055795623 May 18 13:51 D.txt
-rw-r--r-- 1 develop 722 May 18 13:51 M.txt
4-way awk

real 1m10.57s
user 2m30.03s
sys 0m45.67s
-rw-r--r-- 1 develop 722 May 18 13:52 M.txt
-rw-r--r-- 1 develop 3055795623 May 18 13:52 D.txt
Grep

real 0m29.55s
user 0m9.62s
sys 0m30.49s
-rw-r--r-- 1 develop 722 May 18 13:53 M.txt
-rw-r--r-- 1 develop 3055795623 May 18 13:53 D.txt

tmarikle · May 18, 2006, 2:26pm

Your grep is faster than my Sun Solaris grep by a long shot (comparatively speaking of course).

Perderabo · May 18, 2006, 3:23pm

This ksh script should be faster than the original...

#! /usr/bin/ksh
exec < inputfile
IFS=""
exec 3>M.out 4>D.out
while read line ; do
        if [[ $line = M* ]] ; then
                print -u3 "$line"
        else
                print -u4 "$line"
        fi
done
exit 0

madhunk · May 18, 2006, 3:36pm

Thank you very much guys...I did learn a lot today!!