I want to remove the first 6 characters of every second line. This means in lines that do not start with "@HWUSI" or only contains "+".
I have made the following shell script (trim.sh), but it runs very very slow (>than 12 hours for a file with 8,000,000 lines; I have not been patient to run it more than 12 hours).
###
sh trim.sh [filename]
###
#!/bin/bash
if [ "$1" == "" ]; then
echo 'Specify file'
else
FILE="$1"
fi
while read line
do
if [ $line == '+' ] || [ ${line:0:6} == '@HWUSI' ];
then
echo $line
else
VAR=`echo $line | cut -c1-6 --complement`
echo $VAR
fi
done < $FILE
####
Can anyone come up with a suggestion to make it run faster or have I made it in a way, so that it will run for eternity?
Hope for some help. Thanks.
And, if you have mawk available on your platform, give it a try. It's lightning fast! Just did a test on a 1 million line test file:
awk 10 sec.
mawk 0,3 sec.
Yeah, it's much faster, heheheh. I concatenated 150,000 instances of that `head -n8` sample data to create a 52 MB file containing over a million lines:
$ du -sh data
52M data
$ wc data
1074672 1074672 54942606 data
Test runs show sed completes in a second and a half while your shell script, killed after a minute of real time has elapsed, has only processed a small fraction of the file. Roughly, we're talking about 1.5 secs versus an 1 hour, a 2400x improvement, with this data on my hardware:
$ time sed 'n;s/......//' data > dump
real 0m1.545s
user 0m1.123s
sys 0m0.407s
$ wc -l dump
1074672 dump
$ time ./trim.sh data > dump
^C
real 1m1.610s
user 0m12.378s
sys 0m45.385s
$ wc -l dump
16825 dump
$ time sed 'n;s/......//' data > dump
Alister
Alister,
Can you explain how "n;" means every 2nd line? I undertand the obvious expression.
sed '2~1 s/......//'