I need to remove the columns where dashes are the majority, if any of the sequences has any character in that particular position it should be removed too. The IDs and Freqs should be kept intact. Thus, the resulting file should look like this
What have you tried so far?
You've had over 70 posts with multiple solutions given to you in sed/awk/perl for the similarly formatted file. You should be able to come up with at least the initial approach.
I have initiated 16 threads for different 'actions' that happen to be used for the same type of data (DNA). When I decide to start a thread is because I do not know how to go about it, otherwise, I would not post it. I have initiated 23 threads in total, most of them for shell scripting but not exclusively (Red Hat, Windows & DOS, etc). Is it against the rules to post questions that will be dealing with the same type of data? I could always change the format and get a solution that will work but I do not see a reason to do so.
Please advice
Thank you very much -I own you many at this stage
I tried your code
$ perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5)\
> {print $i}}}' Input.txt | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}1' - Input.txt > Output.txt
and this is what I got
Backslash found where operator expected at -e line 1, near ")\"
(Missing operator before \?)
syntax error at -e line 1, near ")\"
syntax error at -e line 2, near ";}"
Execution of -e aborted due to compilation errors.
awk: cmd. line:1: fatal: cannot open file `Input.txt' for reading (No such file or directory)
Am I missing something?
---------- Post updated at 04:29 PM ---------- Previous update was at 04:25 PM ----------
Joeyg,
Exactly! Not only the dashes should be gone but also the character from any record in that particular position. In other words, the entire column should be removed (the IDs and Freqs should be kept intact).
The input file had a problem! I fixed it and now everything is working!
However, I ran into a problem when testing the code with real data. If the file does not contain any dashes then the output file is empty. In reality, if the file does not contain any gaps in the sequences then it should be left intact.
Thanks once again!