I have searched the forum and tried different options. One of the options work but is very slow. The file has millions and millions of records.
It is a TAB delimited file which contains two types of records. Metadata and Detail records.
M PARTNER 8 LAST_BOOKED_DATE D YYYYMMDD
M PARTNER 8 TRIPS_YTD A 11 TRIPS_TOTAL
D NAME FIRST LAST 209 N SANBORN AVE
D NAME FIRST LAST 6997 COUNTY ROAD D
I need to split the file into two files by looking at the first character. All records that start with 'M' go into one file and all records that start with 'D' go into another file.
The following code works but it is too slow...Is there any fast way of accomplishing it?
#!/usr/bin/ksh
while read line
do
char=`echo "$line" | cut -c1`
if [ "$char" = "M" ]; then
echo "$line" >> M.txt
else
echo "$line" >> D.txt
fi
done < head10000.out
exit 0
Any help would be appreciated.
Thank You,
Madhu
'Awk' will be the quick way to do this. Unfortunately I'm not an expert in the syntax, but from experience its much faster that a standard shell script.
Cheers
Helen
Thank you Helen...
I believe something like this would work..
awk -v logfile=${1:-"stdin"} '{ print > logfile"-"$1 }' "$1"
But it is throwing an error...
awk: syntax error near line 1
awk: bailing out near line 1
Any awk experts out there to resolve this situation.
How fast does it need to be ? Is a simple grep faster than looping within a script ?
grep "^M" head10000.out > M.txt
grep "^D" head10000.out > D.txt
awk '{
if ( $0 ~ /^M/) print >"M.txt"
else print >"D.txt"
}' head10000.out
Thank you very much....
There are 13019984 records in the file.
5 M records and the rest of them are D records.
It took 10-11 mins to run the awk program...Is this a good standard?
Thanks again for all the help!
21,700 records per second seems good to me.
You may be able to split the work into two tasks and improve performance.
awk '/^M/' head10000.out > M.txt &
awk '/^D/' head10000.out > D.txt &
I believe grep would do it faster than awk. Try using johnywilkins suggestion and compare the time taken.
awk seems to be faster on my system.
Test with 500 records:
#! /usr/bin/ksh
print "Single task awk"
time {
> M.txt
> D.txt
nawk '{
if ($0 ~ /^M/) print $0 >"M.txt"
else print $0 >"D.txt"
}' test.dat
}
ls -altr M.txt D.txt
print "Two task awk"
time {
> M.txt
> D.txt
nawk '/^M/' test.dat >> M.txt &
nawk '/^D/' test.dat >> D.txt &
wait
}
ls -altr M.txt D.txt
print "4-way awk"
time {
> M.txt
> D.txt
nawk 'NR < 250000 && /^M/' test.dat >> M.txt &
nawk 'NR >= 250000 && /^M/' test.dat >> M.txt &
nawk 'NR < 250000 && /^D/' test.dat >> D.txt &
nawk 'NR >= 250000 && /^D/' test.dat >> D.txt &
wait
}
ls -altr M.txt D.txt
print "Grep"
time {
> M.txt
> D.txt
grep "^M" test.dat > M.txt &
grep "^D" test.dat > D.txt &
wait
}
ls -altr M.txt D.txt
results:
Single task awk
real 3m12.40s
user 0m4.69s
sys 0m9.63s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
Two task awk
real 0m14.12s
user 0m5.93s
sys 0m1.55s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
4-way awk
real 0m16.14s
user 0m10.52s
sys 0m2.48s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
Grep
real 0m22.70s
user 0m1.50s
sys 0m3.24s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
tmarikle:
awk seems to be faster on my system.
Test with 500 records:
#! /usr/bin/ksh
print "Single task awk"
time {
> M.txt
> D.txt
nawk '{
if ($0 ~ /^M/) print $0 >"M.txt"
else print $0 >"D.txt"
}' test.dat
}
ls -altr M.txt D.txt
print "Two task awk"
time {
> M.txt
> D.txt
nawk '/^M/' test.dat >> M.txt &
nawk '/^D/' test.dat >> D.txt &
wait
}
ls -altr M.txt D.txt
print "4-way awk"
time {
> M.txt
> D.txt
nawk 'NR < 250000 && /^M/' test.dat >> M.txt &
nawk 'NR >= 250000 && /^M/' test.dat >> M.txt &
nawk 'NR < 250000 && /^D/' test.dat >> D.txt &
nawk 'NR >= 250000 && /^D/' test.dat >> D.txt &
wait
}
ls -altr M.txt D.txt
print "Grep"
time {
> M.txt
> D.txt
grep "^M" test.dat > M.txt &
grep "^D" test.dat > D.txt &
wait
}
ls -altr M.txt D.txt
results:
Single task awk
real 3m12.40s
user 0m4.69s
sys 0m9.63s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
Two task awk
real 0m14.12s
user 0m5.93s
sys 0m1.55s
-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
4-way awk
real 0m16.14s
user 0m10.52s
sys 0m2.48s-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
Grep
real 0m22.70s
user 0m1.50s
sys 0m3.24s-rw-r--r-- 1 ... 34770850 ... D.txt
-rw-r--r-- 1 ... 46222065 ... M.txt
How do you find the time taken to execute the script?
I have the results like this with the same file..This is a Sunsolaris machine..
Single task awk
real 2m21.59s
user 1m11.76s
sys 0m58.18s
-rw-r--r-- 1 develop 722 May 18 13:48 M.txt
-rw-r--r-- 1 develop 3055795862 May 18 13:50 D.txt
Two task awk
real 1m4.71s
user 1m25.91s
sys 0m33.18s
-rw-r--r-- 1 develop 3055795623 May 18 13:51 D.txt
-rw-r--r-- 1 develop 722 May 18 13:51 M.txt
4-way awk
real 1m10.57s
user 2m30.03s
sys 0m45.67s
-rw-r--r-- 1 develop 722 May 18 13:52 M.txt
-rw-r--r-- 1 develop 3055795623 May 18 13:52 D.txt
Grep
real 0m29.55s
user 0m9.62s
sys 0m30.49s
-rw-r--r-- 1 develop 722 May 18 13:53 M.txt
-rw-r--r-- 1 develop 3055795623 May 18 13:53 D.txt
Your grep is faster than my Sun Solaris grep by a long shot (comparatively speaking of course).
This ksh script should be faster than the original...
#! /usr/bin/ksh
exec < inputfile
IFS=""
exec 3>M.out 4>D.out
while read line ; do
if [[ $line = M* ]] ; then
print -u3 "$line"
else
print -u4 "$line"
fi
done
exit 0
Thank you very much guys...I did learn a lot today!!