Removing columns with dashes

Xterra · July 7, 2010, 3:35pm

My files look like this

I need to remove the columns where dashes are the majority, if any of the sequences has any character in that particular position it should be removed too. The IDs and Freqs should be kept intact. Thus, the resulting file should look like this

Thanks in advance

joeyg · July 7, 2010, 3:51pm

what about using

tr -d "-" <file1 >file2

to remove the dash characters.

vgersh99 · July 7, 2010, 3:51pm

What have you tried so far?
You've had over 70 posts with multiple solutions given to you in sed/awk/perl for the similarly formatted file. You should be able to come up with at least the initial approach.

Xterra · July 7, 2010, 4:07pm

I have initiated 16 threads for different 'actions' that happen to be used for the same type of data (DNA). When I decide to start a thread is because I do not know how to go about it, otherwise, I would not post it. I have initiated 23 threads in total, most of them for shell scripting but not exclusively (Red Hat, Windows & DOS, etc). Is it against the rules to post questions that will be dealing with the same type of data? I could always change the format and get a solution that will work but I do not see a reason to do so.
Please advice

joeyg · July 7, 2010, 4:15pm

If there is a - in position 32 for any record, then position 32 should be deleted for all records?

If so, then need to:
determine what positions hold - characters
convert (delete columns) for records

Correct?

bartus11 · July 7, 2010, 4:17pm

Try this (it is one big command):

perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)\
{print $i}}}' file | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}1' - file

Xterra · July 7, 2010, 4:29pm

Thank you very much -I own you many at this stage
I tried your code

$ perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)\
> {print $i}}}' Input.txt | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}1' - Input.txt > Output.txt

and this is what I got

Backslash found where operator expected at -e line 1, near ")\"
(Missing operator before \?)
syntax error at -e line 1, near ")\"
syntax error at -e line 2, near ";}"
Execution of -e aborted due to compilation errors.
awk: cmd. line:1: fatal: cannot open file `Input.txt' for reading (No such file or directory)

Am I missing something?

---------- Post updated at 04:29 PM ---------- Previous update was at 04:25 PM ----------

Joeyg,
Exactly! Not only the dashes should be gone but also the character from any record in that particular position. In other words, the entire column should be removed (the IDs and Freqs should be kept intact).

bartus11 · July 7, 2010, 4:29pm

xterra:

Thank you very much -I own you many at this stage
I tried your code

$ perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)\
> {print $i}}}' Input.txt | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}1' - Input.txt > Output.txt

and this is what I got

Backslash found where operator expected at -e line 1, near ")\"
(Missing operator before \?)
syntax error at -e line 1, near ")\"
syntax error at -e line 2, near ";}"
Execution of -e aborted due to compilation errors.
awk: cmd. line:1: fatal: cannot open file `Input.txt' for reading (No such file or directory)

Am I missing something?

Remove marked(red) backslash.

Xterra · July 7, 2010, 4:42pm

It worked! I was just wondering why am I getting 2 different number at the very top of the file

I ran it in Cygwin, could that be the problem? I will try on my Linux (red hat) box.

bartus11 · July 7, 2010, 4:49pm

Right. My fault ;). Replace "1" at the end of AWK statement with "FNR!=NR", so it looks like:

$ perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)
> {print $i}}}' Input.txt | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}FNR!=NR' - Input.txt > Output.txt

Xterra · July 7, 2010, 5:02pm

I got the following error

syntax error at -e line 2, near ")
>"
syntax error at -e line 2, near ";}"
Execution of -e aborted due to compilation errors.

bartus11 · July 7, 2010, 5:05pm

The ">" character was from your output, and it isn't part of the code:

perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)
 {print $i}}}' Input.txt | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}FNR!=NR' - Input.txt > Output.txt

Xterra · July 7, 2010, 5:16pm

It is working great on CygWin.
The Output file is empty when I run it on Red Hat though.

bartus11 · July 7, 2010, 5:21pm

I tested it on Oracle Enterprise Linux, which is based on Red Hat, so it should work..

Xterra · July 7, 2010, 7:28pm

The input file had a problem! I fixed it and now everything is working!
However, I ran into a problem when testing the code with real data. If the file does not contain any dashes then the output file is empty. In reality, if the file does not contain any gaps in the sequences then it should be left intact.
Thanks once again!