Getting the non-homogenous letter row from a text file

Lucky_Ali · June 6, 2013, 12:38am

I do have a large tab delimited file with the following format

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCCC 23 65 3 4
AAAAAAAAAAAAAAAAaAAAAAAAAAAAAAAAA 24 6 89 90
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTTT 2 4 8 90
TTTT-TTTTTTTTTTTtTTTTTTTTTTTTTTTT 1 34 89 50
GGGGGGGGGGGGGGGGTGGGGGGGGGGGGGGGG 87 6 78 66
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 8 78 45 61
AAAAATAAAAAAGGGAAAAAAAAAAAAAAAAAA 78 8 9 23

Each row/line will have 33 letters and each line will only have multiple occurrences of etters out of a pool of ATGC (also small atgc). some may have also '-'. I would like to extract those lines (rows) that have a non-homogenious letters or if one or more letter is different compared to the rest, grap that entire column.

This is the desired out put.

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCCC 23 65 3 4
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTTT 2 4 8 90
GGGGGGGGGGGGGGGGTGGGGGGGGGGGGGGGG 87 6 78 66
AAAAATAAAAAAGGGAAAAAAAAAAAAAAAAAA 78 8 9 23

Please let me know the best way to do this in awk.

balajesuri · June 6, 2013, 12:55am

Here's a perl:

perl -ane '/^(.)/ && ($x = $1); print if ($F[0] !~ /^[$x-]+$/i)' file

Lucky_Ali · June 9, 2013, 9:30am

Thanks That worked.
I would like to get another awk solution. I have a file with the following format

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAAA
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT

Each row/line will have 32 letters and each line will only have multiple occurrences of 2 or more letters out of a pool of ATGC (also small atgc). some may have also '-'. I would like to count the occurrence of each alphabet in a line and output the position number/ numbers of all the counted alphabet.

Desired output is

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC C 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 G 6
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAAA A 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32   T 5
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT T 1 3 4 5 7 8 9 10 11 12 13 14 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  G 2 15 16 
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT T 1 2 3 4 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 C 15

Please let me know the best way to do this in awk.

---------- Post updated at 09:30 AM ---------- Previous update was at 05:41 AM ----------

Is there a way to do it in either perl or awk??? looking forward to see suggestions

elixir_sinari · June 9, 2013, 10:22am

Try:

perl -F -lape '%posits=(); 
map { push @{$posits{uc($F[$_])}}, $_+1  unless $F[$_] eq "-" } 0..$#F;
$_ .= " " . join(" ", map { $_, @{$posits{$_}} } keys %posits)' file

Lucky_Ali · July 15, 2013, 8:20pm

How do I modify the above code so that it would count the occurrence of the alphabet that is different from the first alphbet in each of the lines and output the position number/ numbers of that different alphabet only.

Desired output

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC G 6
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAA    T 5
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT G 2 15 16
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT C 15

---------- Post updated at 08:20 PM ---------- Previous update was at 04:51 PM ----------

can we do this using awk?

Don_Cragun · July 15, 2013, 9:12pm

lucky ali:

How do I modify the above code so that it would count the occurrence of the alphabet that is different from the first alphbet in each of the lines and output the position number/ numbers of that different alphabet only.

Desired output
CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC G 6
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAA    T 5
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT G 2 15 16
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT C 15
---------- Post updated at 08:20 PM ---------- Previous update was at 04:51 PM ----------

can we do this using awk?

How many times do we need to answer the same question for you?

What was wrong with the answer you got to this question a year and a half ago: alphabet counting?

Lucky_Ali · July 15, 2013, 10:05pm

That was based on the smallest occurrence. But here I need the count of the alphabet that is different of the first one. It is not necessary that the different alphabet will have the smallest occurrence in each line.

Don_Cragun · July 15, 2013, 11:36pm

But it should be obvious that the two problems are similar, and there is a good chance that the solutions might be similar.

I understand that you're working on genes and that computer science is not your area of expertise. But, this forum is intended to help people learn how to effectively use Linux and UNIX system; not to be an unpaid pool of programmers to write software for your genetic research projects. With well over 150 posts to these forums, and our help in solving more than 75 issues you have raised, and watching what goes on in the UNIX & Linux Forums; we would expect that by now you have started to learn something from all of our help.

What have you tried to do to solve this problem on your own?

Lucky_Ali · July 15, 2013, 11:48pm

Thanks..I will try to solve it. I learned so many things from this forum