Substitute newline with tab at designated field separator

yifangt · April 2, 2013, 5:37pm

Hello, I need to replace newline with tab at certain lines of the file (every four lines is a record).

infile.fq:

@GAIIX-300
ATAGTCAAAT
+
_SZS^\\\cd
@GAIIX-300
CATACGACAT
+
hhghfdffhh
@GAIIX-300
GACGACGTAT
+
gggfc[hh]f

outfile:

@GAIIX-300    ATAGTCAAAT    +    _SZS^\\\cd
@GAIIX-300    CATACGACAT    +    hhghfdffhh
@GAIIX-300    GACGACGTAT    +    gggfc[hh]f

I used

 sed '/^@/N;/^@/N;/^@/N;s/\n/\t/g' infile.fq

without full understanding. I figured out this oneliner when I tried to understand newline substitution in sed.
1) Can anyone explain the three consecutive /^@/N; or N; which works too, for me (mean next line right)?
2) How can I use another command tr to do the job like:

!\n@ | tr  '\n' '\t' < infile.fq > outfile.tab

by adding the condition !\n@ to filter the record separator, which is @ here? I know awk can do the job much easier with RS="@", OFS="\t",

awk 'BEGIN{RS="@"; OFS="\t"} {print $1, $2, $3, $4}' infile.fq

but I want to understand how sed and tr work, if they can, in this case.
Thanks a lot!
YF

rdrtx1 · April 2, 2013, 5:57pm

try also:

paste - - - - < infile.fq > outfile

yifangt · April 2, 2013, 6:07pm

Thanks I forgot to mention that. Did you try the perl version, but not work out.

perl -pe 'BEGIN{$/="@\n"}s/\n/\t/g;$_.=$/'  infile.fq

What did I miss? Same thing as sed and tr for me to understand what is behind the scene. Thanks again!

hanson44 · April 2, 2013, 9:55pm

For sed, each N appends the next line to the pattern space. At the end of the script, sed prints out the four lines glommed together, with tab subsituted for newline:

$ sed "N; N; N; s/\n/\t/g" infile.fq
@GAIIX-300      ATAGTCAAAT      +       _SZS^\\\cd
@GAIIX-300      CATACGACAT      +       hhghfdffhh
@GAIIX-300      GACGACGTAT      +       gggfc[hh]f

Using the diagnostic l command, and turning off auto-print, makes it more clear what is going on:

$ sed -n "N; N; N; l; s/\n/\t/g" infile.fq
@GAIIX-300\nATAGTCAAAT\n+\n_SZS^\\\\\\cd$
@GAIIX-300\nCATACGACAT\n+\nhhghfdffhh$
@GAIIX-300\nGACGACGTAT\n+\ngggfc[hh]f$

I totally don't understand your tr example, would forget about using tr for this.

yifangt · April 2, 2013, 11:07pm

Thanks Hanson!
What's in my mind with tr is: replace "\n" with "\t" if the rows not connected by "\n" and "@" (That's why I wrote, !\n@, which is not a correct syntax, obviously! ). It seems I have to forget this strategy!
Now the "N" is clear in the command. Could you explain what the "1" does? Similar script I saw with awk Awk Command - Luke Jackson.

# if a line ends with a backslash, append the next line to it 
# (fails if there are multiple lines ending with backslash...)
 awk '/pattern/ {sub(/\n/,"\t"); getline t; print $0 t; next}; 1' infile

Thanks again!

hanson44 · April 2, 2013, 11:21pm

Be happy to. l (ell, not one) is a basic sed command that helps diagnose what is going on. l (ell) prints the pattern space in a special format, for debugging. l (ell) is never (or extremely rarely) used for production scripts.

So in the example, l (ell) is showing what the pattern space looks like immediately before running the s (substitute) command. It shows the embedded \n characters that N introduced at each step. It also shows a $ at the end of the line. There is not really a $ there. It is just part of the special display format of the l command, to mark the end of the pattern space.

I think the mnemonic for l is "line", or maybe "list". Not sure.

BTW, if pattern space if long, l (ell) will wrap at 70 characters, which is usually not desirable. You could use "l 0" to run the l command without word wrapping.

Scrutinizer · April 3, 2013, 12:28am

Note: '\t' is GNU sed only. Other seds need a hard TAB characters ( CTRL-V TAB )..

--
awk version:

awk 'ORS=NR%4?"\t":RS' file