Help in awk/bash

Corona688 · January 1, 2013, 4:26pm

grep -v Following < inputfile > outputfile

bioinfo · January 1, 2013, 4:52pm

Thanks.

How should I start learning shell scripting/awk programming better. Any book?

Thanks again.

Don_Cragun · January 2, 2013, 12:25am

In addition to the grep Corona688 provided, you could also add another output file to the awk script I provided, or add an option to the script to control whether or not marker lines should be included in the tro.txt output file, or just always leave out the markers in the tro.txt output file.

bioinfo · January 3, 2013, 9:59am

Hi,
I have two files:

11.txt showing two patterns:

ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
ATOM 1 N SER A 1 35.683 81.326 139.778 1.00 0.00 N 
ATOM 2 CA SER A 1 35.422 82.736 139.929 1.00 0.00 C 
TER
ENDMDL

c.txt

Number of groups: 40  3.95
Group: 0 Branches: 1
0    001
Centre: 001 Nodes: 1
Group: 1 Branches: 1
0    002
Centre: 002 Nodes: 1
Group: 2 Branches: 6
0    009
1    004
2    008
3    007
4    005
5    006
Centre: 006 Nodes: 6

ENDMDL is coming many times in 11.txt. I wish to retreive that pattern corresponds to the value of Id. It means, if I give input of 004 (Id) from group 2, then it should output the fourth repeat from 11. txt ending with ENDMDL.

Id004.txt

Group2: Id 004
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

So, corresponding to value of Id from c.txt, I want to retreive the repeat at the number from 11.txt.

Please guide, how, corresponding to value of Id from c.txt, I can retreive the repeat at the number from 11.txt.
Also, I wish to retreive these patterns in individual files based on their Id, group, centre. For example:
group0.txt contains all patterns with Id
group1.txt contains all patterns with Id
group2.txt contains all patterns with Id
One file containing patterns with corresponding to centre ID

Id001.txt
Id002.txt
Id009.txt
............
............

Thanks

Don_Cragun · January 3, 2013, 11:32pm

bioinfo:

Hi,
I have two files:

ENDMDL is coming many times in 11.txt. I wish to retreive that pattern corresponds to the value of Id. It means, if I give input of 004 (Id) from group 2, then it should output the fourth repeat from 11. txt ending with ENDMDL.

So, corresponding to value of Id from c.txt, I want to retreive the repeat at the number from 11.txt.

Please guide, how, corresponding to value of Id from c.txt, I can retreive the repeat at the number from 11.txt.
Also, I wish to retreive these patterns in individual files based on their Id, group, centre. For example:
group0.txt contains all patterns with Id
group1.txt contains all patterns with Id
group2.txt contains all patterns with Id
One file containing patterns with corresponding to centre ID
Id001.txt
Id002.txt
Id009.txt
............
............

Thanks

This is the third or fourth problem you have posted to this thread. Reading through the thread it is getting hard to determine which problem is being addressed by some of the comments.

I have shown you how to read 11.txt , accumulate the entries in it for each set of lines ending with an ENDMDL line, and print selected entries from the accumulated list. You know what files you want to create and what you want in them, so why don't you try putting together an awk script to do that and let us know what isn't working.

From your description of groups, centres, and IDs, I have no idea how many files you want created nor what is supposed to be in each of them. I also don't see any use for the lines starting with Centre: in your c.txt file; they just have the characters Centre: followed by the Id of the last Branch in the Group that they follow, followed by the characters Nodes: , followed by the number of branches listed on the preceding Group: line. What is the difference between a Node and a Branch? What is the difference between a Group and a Centre?

If you can't do this awk script yourself, you're going to have to give us a lot more detail specifying the exact list of the files you want produced in response to the snippet from c.txt you provided, along with the data that you want written into those files.

bioinfo · January 4, 2013, 12:35am

Thanks
I will post it in a new thread with more detail.

bioinfo · January 7, 2013, 2:57pm

Hi,
Script at # 15 is working great
I have two questions related to it.

(1) If I only want patterns from 11.txt which are divisible by 100 with field 1 ( that means file for no entry if $1%100 != 0), only file no.txt
(2) Also, is it possible to number rows (whose 1st field is divisible by 100 and used for retreiving patterns from 11.txt) and also to number patters retreived from 11.txt

Shall I use following code for (1):

no=${1:-no.txt}         # name of file for no entry if $1%100 != 0
awk -v no="$no" 'BEGIN {rc = 1}
FNR == NR {r[rc] = r[rc] $0 "\n"
    if($0 == "ENDMDL") rc++
    next}
{   # If we got to here, we are reading lines from the 2nd file.
    # Determine exact, truncated, and rounded entry numbers.
    if (substr($1, length($1) - 5) == "00.000") {
        # $1 ends in 00.000; no truncation or rounding needed.
        entry = substr($1, 1, length($1) - 6)
        round = trunc = 0
    } else {
	# $1 is not evenly divisible by 100; calculate rounded and truncated
        # values.
        entry = 0
        round = sprintf("%.0f", $1 / 100)
        trunc = substr($1, 1, length($1) - 6)
    }
          # Write the appropriate entry
        # to each output file.
        printf("%s", r[entry]) > no
       } 
    }'
11.txt o.txt

Thanks.

Don_Cragun · January 7, 2013, 10:03pm

bioinfo:

Hi,
Script at # 15 is working great
I have two questions related to it.

(1) If I only want patterns from 11.txt which are divisible by 100 with field 1 ( that means file for no entry if $1%100 != 0), only file no.txt
(2) Also, is it possible to number rows (whose 1st field is divisible by 100 and used for retreiving patterns from 11.txt) and also to number patters retreived from 11.txt

Shall I use following code for (1):
no=${1:-no.txt}         # name of file for no entry if $1%100 != 0
awk -v no="$no" 'BEGIN {rc = 1}
FNR == NR {r[rc] = r[rc] $0 "\n"
   if($0 == "ENDMDL") rc++
   next}
{   # If we got to here, we are reading lines from the 2nd file.
   # Determine exact, truncated, and rounded entry numbers.
   if (substr($1, length($1) - 5) == "00.000") {
   # $1 ends in 00.000; no truncation or rounding needed.
   entry = substr($1, 1, length($1) - 6)
   round = trunc = 0
   } else {
	# $1 is not evenly divisible by 100; calculate rounded and truncated
   # values.
   entry = 0
   round = sprintf("%.0f", $1 / 100)
   trunc = substr($1, 1, length($1) - 6)
   }
   # Write the appropriate entry
   # to each output file.
   printf("%s", r[entry]) > no
   } 
   }'
11.txt o.txt
Thanks.

No. I assume that you tried running this awk script and got an error saying that your open "{" s didn't match your "}"s. Since you moved the filenames to be processed to a line of their own, if the awk script had run it would have tried to read both input files from standard input (not from 11.txt and o.txt). And, instead of skipping over lines that had $1 that did not end in 00.000, it would have written an entry for the 0th element in 11.txt. In this case you would get what you want since r[0] is an empty string and writing it to the file no wouldn't have done anything.

A corrected and simplified version of this script would be something like:

awk -v no="no.txt" 'BEGIN {rc = 1}
FNR == NR {r[rc] = r[rc] $0 "\n"
    if($0 == "ENDMDL") rc++
    next}
{   # If we got to here, we are reading lines from the 2nd file.
    # Determine exact, truncated, and rounded entry numbers.
    if (substr($1, length($1) - 5) == "00.000") {
        # $1 ends in 00.000; write an entry corresponding to this line.
        entry = substr($1, 1, length($1) - 6)

        # Write the appropriate entry
        # to each output file.
        printf("%s", r[entry]) > no
    }
}' 11.txt o.txt

Yes it is possible to number entries from 11.txt and to number rows from o.txt , but you'll have to specify what you mean by that by showing the exact output that you want to appear in no.txt when using your 11.txt and the following instead of your version of o.txt :

100.000
2010.000
1000.000

If you're talking about adding a tag line to the output specifying the entry # from 11.txt and the line number from o.txt, you have seen examples of how to produce tag lines in earlier scripts I have provided (including the script your stripped down to produce the script above). The entry number from 11.txt being printed is specified by the variable entry and the line number from o.txt producing an output line is specified by the variable FNR .

One way to add a tag doing this would be to change the last printf in the above script from:

        printf("%s", r[entry]) > no

to:

        printf("The following entry from line %d is for Branch %d:\n%s",
            FNR, entry, r[entry]) > no

If you want each line of output in no.txt to include the Branch #. That is also easy to do, but changes the code where entries are accumulated from 11.txt instead of changing the printf at the end of the script. If you want each line of output in no.txt to include the Branch # and the line # from o.txt , that can also be done, but it will involve changing the way the script accumulates and prints entries from 11.txt .

bioinfo · January 7, 2013, 10:43pm

Thanks.
I will try and let you know.

bioinfo · January 9, 2013, 1:27pm

Its working

 printf("The following entry from line %d is for Branch %d:\n%s",
            FNR, entry, r[entry]) > no

But if I want to print the full line as well as branch. Also, I want serial no.
Required output:

(001) The following entry from entry 5 "print full line here" is for branch 2711:
# Branch 2711 is printed here
(002) The following entry from entry 9 "print full line here" is for branch 2716:
# Branch 2716 is printed here
(003) The following entry from entry 13 "print full line here" is for branch 2916:
# Branch 2916 is printed here

Then, using other file (2.txt having one column of some serial numbers) I wish to retreive those branches from above output corresponding to values from 2.txt. For example, I want to retreive 002 from above output:
Required output:

(002) The following entry from entry 9 "print full line here" is for branch 2716:
# Branch 2716 is printed here

Please guide.
Thanks

Don_Cragun · January 9, 2013, 4:31pm

bioinfo:

Its working
 printf("The following entry from line %d is for Branch %d:\n%s",
   FNR, entry, r[entry]) > no
But if I want to print the full line as well as branch. Also, I want serial no.
Required output:
(001) The following entry from entry 5 "print full line here" is for branch 2711:
# Branch 2711 is printed here
(002) The following entry from entry 9 "print full line here" is for branch 2716:
# Branch 2716 is printed here
(003) The following entry from entry 13 "print full line here" is for branch 2916:
# Branch 2916 is printed here
Then, using other file (2.txt having one column of some serial numbers) I wish to retreive those branches from above output corresponding to values from 2.txt. For example, I want to retreive 002 from above output:
Required output:
(002) The following entry from entry 9 "print full line here" is for branch 2716:
# Branch 2716 is printed here
Please guide.
Thanks

With all of the examples I've provided you in both of the active threads you started titled "Help in awk/bash", you should be able to replace the awk printf statement:

 printf("The following entry from line %d is for Branch %d:\n%s",
            FNR, entry, r[entry]) > no

with one that will produce the output you want.

You know how to create a variable to count the number of lines you've written (e.g., outcnt ), you know how to increment that variable before retrieving its value ( ++outcnt ), you know how to use a printf format specifier to print a value as a 3 digit decimal value with leading zero fill ( %03d ), you know that in awk $0 is the contents of the current line, and you know how to use a printf format specifier to print a variable as a string ( %s ).

The only thing you might be missing is how to print a double quote character in a printf format string (since you want the full line to be printed between double quote characters). You do that by escaping each double quote you want to print with a backslash character. An example doing that is:
printf("Print a \"%s\" string\n", "quoted")

Please show me that all of the time I've put into providing samples for you is helping you learn how to use awk by trying this one on your own and then showing us what you've done!

bioinfo · January 10, 2013, 11:39am

Thanks.
Its working. I modified the way of printing and got the required output.

Don_Cragun · January 10, 2013, 12:47pm

Great.

Have you figured out how to use awk, sed, or the shell to extract entries listed in 2.txt from the output you just produced?

bioinfo · January 10, 2013, 1:08pm

I modified the code in printf statement for first output and that helped me in getting second output. yippie

Thanks a lot to you.