AWK or python??

Hi!

I'm a experimental biochemist, but I'm few time ago, I started to do bioinformatics stuffs.
I have to filer the output results of some programs that I've been used.

I've learned bash and a little bit of AWK. But, now I have to do filters more complicated, and I don't know how! :confused:

What I should learn? AWK or python?

For example, here are one of the filters, that I have to do:

Original Data

NAMES      Match     %Score          Signals       Path
--------------------------------------------------------------------------------------------------

AK154051    1612    83.5666148263    GTAG,GTAG    AK154051chrX155,398,170,105,317,40,1,34,38,71,36,95,136,95,27,4,14,22,42,71,0,158,556,726,1327,1644,1788,1789,1823,1861,1933,1969,2064,2203,2302,2330,2336,2352,2375,2432
AK154051    1918    99.4297563504    GTAG,GTAG,GTAG,GTAG    AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154    378    12.3368146214    ACAC,CTGC,CTAC    AK154154chr10211,17,2,36,1,440,133,0,211,228,230,30134,119467,147210
AK154154    575    18.7663185379    AATG    AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154356    1082    35.3710362864    GCAG,GTAG,GTAG,ATAC,GTAG,ATCC,GTAG    AK154356chr617,2,99,4,102,33,3,23,75,56,176,154,17,112,67,108,186,30,40,46,17,1,105,99,4,2,36,0,3528,3530,3629,16150,16253,84877,84882,84908,95995,128988,129165,129320,129338,129451,129518,151565,151758,151788,151828,151874,151891,210498,210603,210708,210721,210726
AK154356    1110    36.2863680941    GTAG,GTAG,GTAG    AK154356chrX189,151,25,44,102,15,4,11,8,28,12,11,54,3,21,40,114,173,19,154,129,3,181,53,112,99,0,191,342,367,411,513,647,653,666,674,702,715,120143,120199,120207,120229,120269,120384,120558,120578,120733,131163,131170,131354,131408,131521
AK154356    2238    73.161163779    ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG    AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436
AK154365    1932    85.0352112676    GTAG,GTAG,ATAC    AK154365chr12152,148,11,79,158,145,337,2,8,24,35,35,246,254,157,179,137,134,0,136504,136653,136665,136746,136904,137049,137386,137388,144711,144735,144771,144807,145054,145308,145641,145820,145964
AK154365    1970    86.7077464789    ATAC,GTAG,GTAG    AK154365chr12134,137,178,417,246,35,35,24,8,2,337,145,158,79,11,112,187,0,136,273,628,1046,1293,1329,1364,8533,8541,8543,8880,9025,9184,9264,9276,94145
AK154365    2093    92.1214788732    ACCC,CTAC    AK154365chr1224,276,11,332,64,341,34,38,30,213,32,602,137,134,0,25,302,314,646,719,60885,175711,175750,175781,175994,176027,176630,176777
AK154365    2109    92.8257042254    CTGC    AK154365chr12135,136,340,249,246,35,33,376,396,11,300,0,162,298,638,888,1135,1171,9541,9929,10326,10338
AK154365    2109    92.8257042254    CTGC    AK154365chr12300,11,396,376,33,35,246,249,340,136,135,0,301,313,721,9518,9552,9588,9835,10084,10424,10587
AK154650    2570    99.3428681871    GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG    AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995

Filter:
At column 1, there are groups of lines which have the same name. Inside of this groups, select the name with the highest perceptual score (column 3) and the names that have 10 or less percent of difference with the highest score.
Then, if has been selected more than one name and there are names which only have GTAG, ATAC or GCAG signals, select the one of this which have the best score.

Desired Output

NAMES      Match     %Score          Signals       Path
--------------------------------------------------------------------------------------------------

AK154051    1918    99.4297563504    GTAG,GTAG,GTAG,GTAG    AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154    378    12.3368146214    ACAC,CTGC,CTAC    AK154154chr10211,17,2,36,1,440,133,0,211,228,230,30134,119467,147210
AK154154    575    18.7663185379    AATG    AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154356    2238    73.161163779     ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG     AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436
AK154365    1970    86.7077464789    ATAC,GTAG,GTAG     AK154365chr12134,137,178,417,246,35,35,24,8,2,337,145,158,79,11,112,187,0,136,273,628,1046,1293,1329,1364,8533,8541,8543,8880,9025,9184,9264,9276,94145
AK154650    2570    99.3428681871    GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG    AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995

Code suggestion are welcome :b:

But, mainly I need to know the easier way to do this, AWK or python or another? Because I have to do many filters and operations like that.

Thanks for your time!!!

I don't know about Python;
I'd recommend awk or perl; perl would be a better choice if a lot of substitutions is to be done.
Also IMHO it's easier to write/maintain complex perl scripts, and perl is easier to debug.
They are both very fast processing line-by-line.

awk has an obnoxious line-length limit on many systems. perl never has any limit except available memory and address space. perl's also available more places than python is, I think, and biochemists have been using it for similar purposes for quite a long time.

So, really it's all about Horses for Courses, sledgehammers and nuts, and all that?

I suppose the real winner is Perl. It's more widely available than Python, and is more powerful than AWK.

So long as people don't start playing golf with it, it's a fine language!

the last step below I didn't understand.

Then, if has been selected more than one name and there are names which only have GTAG, ATAC or GCAG signals, select the one of this which have the best score.

So first, make groups of same name (1st column), then find out the highest score in the group, this entry should be in final result.

2nd step, compare other entries, within the same group, take a look how many entries with the score difference (with highest score) <=10

if the count >=2 (the 2 here doesn't include the highest score entry?) check the signals, entry which has higher score with certain signal wins.

is that correct?

therefore in the final result, for the same name, there would be max two entries. is that right?

and in your output

why AK154650 is not there?
and why

 AK154365    1970    86.7077464789

is there but

AK154365	2109	92.8257042254

not??

i make a python script, it may not exactly match what you need. but would be a good example that python can do it easily.

since I didn't understand your last requirement, the script didn't do the last step, for the same name, the script lists out the highest score entry and the entry with diff <=10 and only has GTAG, ATAC or GCAG signals

to make it simple, assume the column delimiter is "TAB" and the sample text has no header, that is, the first line is the data.

#!/usr/bin/python
def doFilter():
    f = open("t.txt")
    groups = {}
    result = []
    dict = set(['GTAG','ATAC','GCAG'])
    try:
        rows = [x.replace("\n","") for x in f.readlines()]
        for r in rows:
            key = r.split("\t")[0]
            if (not groups.has_key(key)):
                groups[key] = []
            groups[key].append(r)
        keys = groups.keys()
        for k in keys:
            maxl = max(groups[k], key=lambda x: x.split("\t")[2])
            result.append(maxl)
            for l in groups[k]:
                diff = float(l.split("\t")[2]) - float(maxl.split("\t")[2])
                if(diff >0 and diff <= 10 and len(set(l.split("\t")[3].split(","))-dict)==0):
                    result.append(l)
        for i in result:
            print i
                

    finally:
        f.close()

doFilter()



output:

AK154051	1918	99.4297563504	GTAG,GTAG,GTAG,GTAG	AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154	575	18.7663185379	AATG	AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154365	2109	92.8257042254	CTGC	AK154365chr12135,136,340,249,246,35,33,376,396,11,300,0,162,298,638,888,1135,1171,9541,9929,10326,10338
AK154650	2570	99.3428681871	GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG	AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995
AK154356	2238	73.161163779	ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG	AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436

Thanks sk1418!

Your code will guide me a lot!

I didn't explain the filter in simple words, sorry about that, but it isn't easy.

It's beacuse

AK154365    2109    92.8257042254

has non GTAG, ATAC or GCAG signal, instead has a CTGC signal.

If the score isn't 10 percent higher, I want to selects those names which have GTAG, ATAC or GCAG signals.

Thanks you anyway! :b: