Help in awk/bash

bioinfo · January 4, 2013, 1:08am

Hi, I have two files: atom.txt and g.txt
atom.txt has multiple patterns but I am showing only two patterns each ending with ENDMDL:

ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
ATOM 1 N SER A 1 35.683 81.326 139.778 1.00 0.00 N 
ATOM 2 CA SER A 1 35.422 82.736 139.929 1.00 0.00 C 
TER
ENDMDL

g.txt

Group   Centre      Branches              Id_of_Branches
 10       051          30            003, 007, 051, 034, .................. (30 values)   
 72       183          26            100,................................    
394       600          23             ...................................    
391       641          20             .....................................

Corresponding to value of Id of Branches from g.txt, I wish to retreive that pattern from atom.txt.
Therefore, required 4 output files corresponding to 4 groups and 5th file for patterns corresponding to Id from Centre:

(1) G10.txt
#Id 003
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#Id 007
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
.....................
.....................

(2)G72.txt
#Id 100
ATOM 1 N SER A 1 37.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 37.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
.....................
.....................

(3)G394.txt
..................
(4)G391.txt
...................
(5) Centre.txt
#Id 051
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
.....................
.....................
.....................

Thanks

RudiC · January 4, 2013, 5:41am

Not clear. What do you want to do to which pattern based on which rule/selection/...? I only can infer from your sample data that you want a .txt file for each group containing the id# and pattern #1 plus one for the centre containing ?# and pattern #1.
Pls specify.

Don_Cragun · January 4, 2013, 7:04am

Note that this is the third thread you have started titled "Help in awk/bash". More descriptive titles would help readers find the right thread.

In the last few posts in your last thread: Help in awk/bash, you said that one of the input files for this project was the file 11.txt which you said contained 10,000 entries. That seems to be the file atom.txt in this thread. You show that all IDs (which seem to be indices into that file) have three digit values(with leading zero fill). Is atom.txt limited to less than 1,000 entries or do some IDs have more than three digits?

You have way too many occurrences of ......... in your posting to determine what you want. In your description, you say:

Group   Centre      Branches              Id_of_Branches
 10       051          30            003, 007, 051, 034, .................. (30 values)   
 72       183          26            100,................................    
394       600          23             ...................................    
391       641          20             .....................................

what does "(30 values)" mean. Are there 30 fields (some with multiple spaces as separators, some with comma-space as separators [or terminators]) in every line? Are there 33 fields on every line (one for Group ID, one for Centre ID, one for Branches ID, and one for each of 30 branches)? Are there 3 + value_of_3rd_field fields? Give us real values at least for these 4 lines instead of making us guess what .......... means! Did you add commas between or after some fields just to make it harder to process the input?

You say you want 5 output files. Does that mean that g.txt will always contain 4 data lines (plus the line of headings)?

You say: "...5th file for patterns (plural) corresponding to Id (singular) from Centre...". Does this mean that Centre.txt is supposed to contain all 120 (or 99, or ???) entries that will be stored in the four Gxxx files? Does it just contain Id051 as shown? Or, does it contain one entry for each data line in g.txt ?

I supplied several awk scripts with detailed explanations of how those script worked in your last thread on this subject (see link above). Can you show us the awk script you're writing to solve this problem? Or are you expecting us to figure out what you want done and do it for you? The purpose of The UNIX and Linux Forums is to help you learn how to write your own scripts; not to act as a place where you can get people to do your design and implementation work for you for free.

bioinfo · January 4, 2013, 10:16am

I am very thankful to you Don Cragun for helping me in writing scripts and explaining them as well.
I am very new in this field of shell scripting but I cannot rely on other programming language because I am not expert in any language. I have started reading shell scripting books, but its difficult for me to figure out what to write in a script. I don't know sometimes what are the functions or commands available in shell scripting I can use. But, when you write script then I come to know about lot of things and I try to read it.
I know I should write my own script and post here for help, but sometimes even I am unable to guess how I have to start. I have found this forum as my best guide on the internet.

I am adding more information and real values for the last post.

atom.txt has less than 1000 entries, so Ids don't have more than 3 digits.
There are 30 values in Group 10 with comma separator, Group 72 has 26 values and so on, Id_of_Branches means number of values in each group.
There are not 33 fields on every line (one for Group ID, one for Centre ID, one for Branches ID, and one for each of 30 branches). There are not 3 + value_of_3rd_field fields?

I have two files and I combined them into one g.txt (using comma separator for Id_of branches), two files are:

First file:

Group: 0 Number of Branches: 1
0    001
Centre: 001 Branches: 1
Group: 1 Number of Branches: 1
0    002
Centre: 002 Branches: 1
Group: 2 Number of Branches: 1
0    003
Centre: 003 Branches: 1
Group: 3 Number of Branches: 6
0    009
1    004
2    008
3    007
4    005
5    006
Centre: 006 Branches: 6
Group: 4 Number of Branches: 2
0    010
1    011
Centre: 010 Branches: 2
Group: 5 Number of Branches: 2
0    012
1    013
Centre: 012 Branches: 2
Upto more than 600 groups


Second file:

Group No:
 10        Centre: 052 Branches: 31                   
 73        Centre: 184 Branches: 25                   
397        Centre: 607 Branches: 23                   
398        Centre: 640 Branches: 22                   
 86        Centre: 245 Branches: 19                   
 71        Centre: 167 Branches: 12                   
 78        Centre: 220 Branches: 11                  
 18        Centre: 084 Branches: 10                   
 09        Centre: 022 Branches: 10                   
400        Centre: 650 Branches: 9

I wish to have 10 files for 10 groups (as per second file) each with pattern corresponding to the Id _of_Branches (from first file) in each group.
Centre.txt (only one file) is supposed to contain patterns corresponding to Centre Id from each group.

Thanks.

jim_mcnamara · January 4, 2013, 11:42am

Rather than reading shell books, consider some awk tutorials. A lot of bioinformatic folks comne here for help. 95% of their problems are resolved by awk. awk is a language on its own.

This is a great resource. Gawk is GNU awk, which is very probably what you have when you enter the word awk on the screen.
It has examples, explains the bizarre syntax, and program structure:

bioinfo · January 4, 2013, 3:27pm

Thanks Jim Mcnamara.
Its great.

RudiC · January 5, 2013, 9:15am

As much as I want to help, I am sorry I have to say I can't. Thank you for the effort explaining your input in detail, but post #4 does not relate to post #1 by no means. E.g. group No. 10 being centered at 052 here and 051 there, having 31 branches here and 30 there, groups showing up here not showing up there and vice versa, and, groups in file2 not being represented in file 1.
On top, I still can't see what pattern to fill in (see my post #2), where to get it, based on what rule, even if I take file g.txt to be a distilled version of file1 and file2.
It would be helpful if you post a minimum number of input files (e.g. atoms.txt and g.txt) with interrelating data, an output file and a set of understandable rules on how to get one into the other.

bioinfo · January 5, 2013, 3:58pm

I have uploaded a part of first file and full second file. I have posted real values for second file, but first file is very big.

Don_Cragun · January 5, 2013, 4:45pm

You said you had two two files: atom.txt and g.txt. I am assuming that atom.txt is in the same format as 11.txt in your last thread with the same title as this thread. You have not given us anything that includes even a single complete line (after the header line) from the file g.txt. And, you have not shown us what you want to appear in G10.txt, and any other G*.txt file that we can match against what you have shown us from atom.txt.

With the data you gave us in message #4 in this thread, the First file gives us an indication of what might appear in g.txt for groups 0 through 5, but none of them are listed in g.txt in message $1 nor in Second file in message #4 in this thread.

If you don't give us coherent sample data so we can put together with sample output that matches the sample data you give us, it is EXTREMELY hard to figure out what you want. I think I'm close to figuring out what you want done and expect to post something later this afternoon. But, I have no confidence that it be be at all close to what you want because the specification of what you want is so vague. And, you haven't given us sample input and output that we can use to determine if a possible solution we might develop does what you want done.

bioinfo · January 5, 2013, 5:38pm

Yes, atom.txt is same as 11.txt. While posting in new thread I just used new name . I am explaining my problem again with more details and concise data. I have two files atom.txt (or 11.txt from other thread) and g.txt (which I made using data from raw files file 1 and file 2). If you feel that it will be easy to retreive data directly from file 1 and file 2 rather that using g.txt for retreiving patterns from atom.txt, I will be happy to go for it too.

g.txt (made it more concise and short; in reality I have 10 groups for this file out of more than 600 groups from file 1. Based on decreasing number of branches they are grouped into 10 groups in g.txt but I am showing only 2 here)

Group   Centre      Branches       Id_of_Branches
 3       006          6         009,004,008,007,005,006
 5       012          2         012,013

file 1:

Group: 0 Number of Branches: 1
0    001
Centre: 001 Branches: 1
Group: 1 Number of Branches: 1
0    002
Centre: 002 Branches: 1
Group: 2 Number of Branches: 1
0    003
Centre: 003 Branches: 1
Group: 3 Number of Branches: 6
0    009
1    004
2    008
3    007
4    005
5    006
Centre: 006 Branches: 6
Group: 4 Number of Branches: 2
0    010
1    011
Centre: 010 Branches: 2
Group: 5 Number of Branches: 2
0    012
1    013
Centre: 012 Branches: 2
Upto more than 600 groups

file2:

Group No:
 3        Centre: 006 Branches: 6                   
 5        Centre: 012 Branches: 2

Required output:
Corresponding to value of Id_of_Branches from g.txt, I wish to retreive that pattern from atom.txt.
Therefore, in this sample data, I required 3 output files; 2 files corresponding to all IDs from 2 groups and 3rd file for patterns corresponding to Id of Centre from all groups:

(1) g3.txt
#009
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#004
ATOM 1 N SER A 1 34.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#008
ATOM 1 N SER A 1 45.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#007
ATOM 1 N SER A 1 50.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 65.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#005
ATOM 1 N SER A 1 90.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 89.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#006
ATOM 1 N SER A 1 67.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 23.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

(2)g5.txt
#012
ATOM 1 N SER A 1 37.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 37.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#13
ATOM 1 N SER A 1 40.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 31.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

(3) Centre.txt (For Id from centre of all groups)
#006
ATOM 1 N SER A 1 67.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 23.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#012
ATOM 1 N SER A 1 37.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 37.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

Hope I am able to make my problem more clear.

Don_Cragun · January 5, 2013, 8:13pm

Hi bioinfo. The awk script I had been testing out given your earlier messages didn't work with the new details you provided in message #10 in this thread. (The output filenames changed from Gx to gx where x is a one to three digit string, the list of branches changed from comma and space separators to just comma separators, and I was guessing completely wrong about what you wanted in one of the output files. I think the script below does what you want. It is LONG, but the vast majority of it is just comments. Hopefullly it will help you figure out how it works:

awk '
# All data is assumed to meet the requirements stated below, so this script
# does not perform any data verification.  If any data fails to meet these
# assumptions, results are unspecified.
BEGIN {
    # Initialize variables that do not have default values set by awk.
    cf = "Centre.txt"
    rc = "001"
}
FNR == NR {
    # Process lines from atom.txt.  Assumed format is that each entry in this
    # file is a multiple line value with the final line of each entry matching
    # the ERE "^ENDMDL$".  Entries from this file are stored in array r with
    # the index being the entry number (starting with 001).  The variable rc is
    # the index for the value being accumulated.  I use a 3 digit string with
    # leading zero fill to match the format of the Centre-ID and Branch-ID
    # values that will be found in g.txt.
    r[rc] = r[rc] $0 "\n"
    if($0 == "ENDMDL")
        # End of entry found.  Set rc for the next entry to be processed.
        rc = sprintf("%03d", rc + 1)
    next
}
FNR == 1 {
    # Skip the header line on subsequent file(s).  The file g.txt is assumed to
    # be the first such file.  Any number of other files in the same format can
    # be used in addition to or instead of g.txt.
    next
}
{   # Process lines from subsequent files.  Assumed format is:
    #   Group   Centre      Branches              Id_of_Branches
    #   gid     cid         bcnt         bid[1],bid[2],...,bid[bcnt]
    # where gid is a 1-3 digit Group-ID, cid is a 3 digit (zero filled)
    # Centre-ID, bcnt is a count of the number of Branch-IDs to follow, and
    # each bid field is a 3 digit (zero filled) Branch-ID.  The header line
    # has already been discarded.  Commas will be converted to spaces so bid
    # values can be used directly.  It is assumed that each line contains
    # $3 + 3 fields.
    #
    # Create a file named gx.txt (where x is the Group-ID from this line):
    #   Note that it would seem logical to expand x to a 3 digit zero filled
    #   value so the created g* files would sort into Group-ID order, but that
    #   is not what was requested.
    #   One entry from atom.txt (with the entry number determined by the
    #   Branch-ID) will be written to this file for each Branch-ID on this
    #   line.
    #
    # Also create a file named Centre.txt that will contain one entry from
    #   atom.txt (with the entry number determined by the Centre-ID) for each
    #   line processed.
    #   Note: I assume that a Centre-ID is also a Branch-ID and that the value
    #   given as the cid should also appear as one of the Branch-IDs appearing
    #   on each line.
    #
    # Replace commas on input lines with spaces so the Branch-IDs can be used
    # directly without splitting $4 into another array and processing it in a
    # different loop (besides that some descriptions of this input file say
    # elements are comma separated and other say comma-space separated or
    # terminated; this works either way):
    gsub(/,/, " ")
    # Create the g*.txt file for this line.  Uncomment one of the following two
    # lines.  The 1st line provides requested names, the 2nd line creates names
    # that will sort correctly by Group-ID when looking at output by ls and
    # when having the shell match the patterns g*.txt and g???.txt and groups
    # in the list do not all contain the same number of digits.
    gf = "g" $1 ".txt"
    #gf = sprintf("g%03d.txt", $1)
    for(i = 4; i <= NF; i++) printf("#Id %s\n%s", $i, r[$i]) > gf
    close(gf)
    # Add entry to Centre.txt:
    printf("#Id %s\n%s", $2, r[$2]) > cf
}' atom.txt g.txt

As always, if you're running on a Solaris system, use /usr/xpg4/bin/awk or nawk instead of awk .

bioinfo · January 5, 2013, 11:33pm

Thanks. I will try it and let you know.

---------- Post updated at 11:33 PM ---------- Previous update was at 08:18 PM ----------

Yippie. Its working.
Thanks a lot. You are a GENIUS