awk if statement not printing entire field

pbluescript · June 29, 2012, 11:40am

I have an input that looks like this:

chr1    mm9_knownGene   utr3    3204563 3206102 0       -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   utr3    4280927 4283061 0       -       .       gene_id "Rp1"; transcript_id "uc007aew.1";
chr1    mm9_knownGene   utr3    4333588 4334680 0       -       .       gene_id "Rp1"; transcript_id "uc007aex.2";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aey.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aez.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afa.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afb.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afc.1";
chr1    mm9_knownGene   utr3    4763279 4766544 0       -       .       gene_id "Mrpl15"; transcript_id "uc007aff.2";
chr1    mm9_knownGene   utr3    4763279 4764532 0       -       .       gene_id "Mrpl15"; transcript_id "uc007afd.2";

I am changing columns 4 and 5 with this awk line:

awk '{FS=OFS="\t"} {if($5-$4<=200) print $0; else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9}'

Which gives this output:

chr1    mm9_knownGene   utr3    3204563 3204763 0       -       .       gene_id
chr1    mm9_knownGene   utr3    4280927 4281127 0       -       .       gene_id "Rp1"; transcript_id "uc007aew.1";
chr1    mm9_knownGene   utr3    4333588 4333788 0       -       .       gene_id "Rp1"; transcript_id "uc007aex.2";
chr1    mm9_knownGene   utr3    4481009 4481209 0       -       .       gene_id "Sox17"; transcript_id "uc007aey.1";
chr1    mm9_knownGene   utr3    4481009 4481209 0       -       .       gene_id "Sox17"; transcript_id "uc007aez.1";
chr1    mm9_knownGene   utr3    4481009 4481209 0       -       .       gene_id "Sox17"; transcript_id "uc007afa.1";
chr1    mm9_knownGene   utr3    4481009 4481209 0       -       .       gene_id "Sox17"; transcript_id "uc007afb.1";
chr1    mm9_knownGene   utr3    4481009 4481209 0       -       .       gene_id "Sox17"; transcript_id "uc007afc.1";
chr1    mm9_knownGene   utr3    4763279 4763479 0       -       .       gene_id "Mrpl15"; transcript_id "uc007aff.2";
chr1    mm9_knownGene   utr3    4763279 4763479 0       -       .       gene_id "Mrpl15"; transcript_id "uc007afd.2";

It handles columns 4 and 5 fine, but truncates column 9 for only the first line. If I use this awk line, the output is fine:

awk '{FS=OFS="\t"} {print $0}'

However, this awk line duplicates the column 9 truncation error:

awk '{FS=OFS="\t"} {print $1,$2,$3,$4,$5,$6,$7,$8,$9}'

I can add more columns and get more of that first line, which indicates it is treating the first line differently than the rest. I have manually edited the input file to ensure column 9 of the first line does not have any tabs. I have also moved the first line to the end of the file and the new first line shows the same truncation.
Any suggestions? What am I doing wrong here?

vbe · June 29, 2012, 11:49am

I tried on an AIX box and it doesnt get truncated:

n12:/sm/bin/wks $ resize
COLUMNS=142;
LINES=32;
export COLUMNS LINES;
n12:/sm/bin/wks $ cat testfile001 |awk '{FS=OFS="\t"} {if($5-$4<=200) print $0; \
else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9}'
chr1    mm9_knownGene   utr3    3204563 3206102 0       -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   utr3    4280927 4283061 0       -       .       gene_id "Rp1"; transcript_id "uc007aew.1";
chr1    mm9_knownGene   utr3    4333588 4334680 0       -       .       gene_id "Rp1"; transcript_id "uc007aex.2";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aey.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aez.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afa.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afb.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afc.1";
chr1    mm9_knownGene   utr3    4763279 4766544 0       -       .       gene_id "Mrpl15"; transcript_id "uc007aff.2";
chr1    mm9_knownGene   utr3    4763279 4764532 0       -       .       gene_id "Mrpl15"; transcript_id "uc007afd.2";

mayursingru · June 29, 2012, 11:56am

I think will solve your query. Its self explanatory.

 awk '{FS=" "} {if($5-$4<=200) print $0; else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9,$10,$11,$12}' test.txt

pbluescript · June 29, 2012, 12:05pm

I apologize, I should have been more clear. This is what I meant when I said "I can add more columns and get more of that first line" in my original post. I'm more concerned about why the first line is being treated differently than all the others.

---------- Post updated at 12:05 PM ---------- Previous update was at 12:02 PM ----------

vbe:

I tried on an AIX box and it doesnt get truncated:

n12:/sm/bin/wks $ resize
COLUMNS=142;
LINES=32;
export COLUMNS LINES;
n12:/sm/bin/wks $ cat testfile001 |awk '{FS=OFS="\t"} {if($5-$4<=200) print $0; \
else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9}'
chr1    mm9_knownGene   utr3    3204563 3206102 0       -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   utr3    4280927 4283061 0       -       .       gene_id "Rp1"; transcript_id "uc007aew.1";
chr1    mm9_knownGene   utr3    4333588 4334680 0       -       .       gene_id "Rp1"; transcript_id "uc007aex.2";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aey.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007aez.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afa.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afb.1";
chr1    mm9_knownGene   utr3    4481009 4481796 0       -       .       gene_id "Sox17"; transcript_id "uc007afc.1";
chr1    mm9_knownGene   utr3    4763279 4766544 0       -       .       gene_id "Mrpl15"; transcript_id "uc007aff.2";
chr1    mm9_knownGene   utr3    4763279 4764532 0       -       .       gene_id "Mrpl15"; transcript_id "uc007afd.2";

Thanks for letting me know. I'm using an LSF cluster running RHEL5.3. Now I have a bit more to go to my admins with.

elixir_sinari · June 29, 2012, 12:08pm

The problem is quite simple. You need to change your awk program to the following one:

awk 'BEGIN{FS=OFS="\t"} {if($5-$4<=200) print $0; else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9}' inputfile

Without the BEGIN pattern, the first line is read with the default FS and not YOUR FS. After the first line, your FS kicks in.

pbluescript · June 30, 2012, 9:58am

elixir_sinari:

The problem is quite simple. You need to change your awk program to the following one:
awk 'BEGIN{FS=OFS="\t"} {if($5-$4<=200) print $0; else if($5-$4>200) print $1,$2,$3,$4,$4+200,$6,$7,$8,$9}' inputfile
Without the BEGIN pattern, the first line is read with the default FS and not YOUR FS. After the first line, your FS kicks in.

Thank you! I'm glad it was something so simple.