Deletion of characters in every second line

Hi,

I have a file with four types of lines:

@HWUSI-EAS656_0022:8:1:175:376#0/1
CTCGACACGCTGGTGGAGATCCTCGGCCTGACCGAGGACGACCGGGCCATCTTCGAGCAGCGC
+
BBBBBBBA?AABB;<=B9<===AA1AA==>99===.9?:9A4A956%%%%%%%%%%%%%%%%%

I want to remove the first 6 characters of every second line. This means in lines that do not start with "@HWUSI" or only contains "+".

I have made the following shell script (trim.sh), but it runs very very slow (>than 12 hours for a file with 8,000,000 lines; I have not been patient to run it more than 12 hours).

###
sh trim.sh [filename]
###
#!/bin/bash
if [ "$1" == "" ]; then
   echo 'Specify file'
else
   FILE="$1"
fi
while read line
do
if [ $line == '+' ] || [ ${line:0:6} == '@HWUSI' ];
then
        echo $line
        else
        VAR=`echo $line | cut -c1-6 --complement`
        echo $VAR
fi
done < $FILE
####

Can anyone come up with a suggestion to make it run faster or have I made it in a way, so that it will run for eternity?
Hope for some help. Thanks.

Hi,

Try this:

awk '!/^[@+]/{print substr($0,7)}/^[@+]/{print}' file
awk '!/^[@+]/{if(f){f=0;$0=substr($0,7)}else{f=1}}1' file

And, if you have mawk available on your platform, give it a try. It's lightning fast! Just did a test on a 1 million line test file:
awk 10 sec.
mawk 0,3 sec.

Hi, thanks for the responses. I tried both, however, I was giving an error message about the "!" ?

interaction[rlmn]:/home/projects/rlmn/sommerdata/run2> awk '!/^[@+]/{print substr($0,7)}/^[@+]/{print}' MS_Pa-plex_1_tag1.sanfastq 
Bad ! arg selector.

interaction[rlmn]:/home/projects/rlmn/sommerdata/run2> awk '!/^[@+]/{if(f){f=0;$0=substr($0,7)}else{f=1}}1' MS_Pa-plex_1_tag1.sanfastq 
Bad ! arg selector.

This is the head of my input file:

interaction[rlmn]:/home/projects/rlmn/sommerdata/run2> head -n8 MS_Pa-plex_1_tag1.sanfastq
@HWUSI-EAS656_0034:7:12:1:291#0/1
AGCNGTGAGTATGGGATCGCCGACCTGCGCGGCACGCATGACGCGGAAGTGATCGCNGCGCTGCGGCGCATCGCCGNNNNNNN
+
@B<%+;B3??3AA<6=?A>BAA7?;3??>;@A><'B'6=;7,+2:######################################
@HWUSI-EAS656_0034:7:12:2:1814#0/1
AGCNGTCCTCGATGCGGGTCCGTGCATAGTGTTCGCCGTCCTGGCTCTGCACATACTCGCCAGGGCAGTCGTAGACNNNNNNN
+
7?;%<ABBAA;?@7@AA??>>@;@A<ABB2;===66>=1?1/5;==55=/9277#############################
sed 'n;s/......//'

Test run with the data you provided in your latest post:

$ sed 'n;s/......//' MS_Pa-plex_1_tag1.sanfastq 
@HWUSI-EAS656_0034:7:12:1:291#0/1
GAGTATGGGATCGCCGACCTGCGCGGCACGCATGACGCGGAAGTGATCGCNGCGCTGCGGCGCATCGCCGNNNNNNN
+
B3??3AA<6=?A>BAA7?;3??>;@A><'B'6=;7,+2:######################################
@HWUSI-EAS656_0034:7:12:2:1814#0/1
CCTCGATGCGGGTCCGTGCATAGTGTTCGCCGTCCTGGCTCTGCACATACTCGCCAGGGCAGTCGTAGACNNNNNNN
+
BBAA;?@7@AA??>>@;@A<ABB2;===66>=1?1/5;==55=/9277#############################

Regards,
Alister

1 Like

Heps,
it works. And much much faster than 12 hours. I guess I need to pay more attendance to sed and awk...
Thanks

You're quite welcome.

Yeah, it's much faster, heheheh. I concatenated 150,000 instances of that `head -n8` sample data to create a 52 MB file containing over a million lines:

$ du -sh data
 52M    data
$ wc data   
 1074672 1074672 54942606 data

Test runs show sed completes in a second and a half while your shell script, killed after a minute of real time has elapsed, has only processed a small fraction of the file. Roughly, we're talking about 1.5 secs versus an 1 hour, a 2400x improvement, with this data on my hardware:

$ time sed 'n;s/......//' data > dump

real    0m1.545s
user    0m1.123s
sys     0m0.407s
$ wc -l dump
 1074672 dump

$ time ./trim.sh data > dump
^C

real    1m1.610s
user    0m12.378s
sys     0m45.385s
$ wc -l dump
   16825 dump

Regards,
Alister

What about the OP request ?

Hi, I am not sure what OP means?

[quote=alister;302427468]
You're quite welcome.

$ time sed 'n;s/......//' data > dump
 
Alister
 
Alister,
 
Can you explain how "n;" means every 2nd line? I undertand the obvious expression.
 
 
sed '2~1 s/......//'