Help in awk/bash

bioinfo · December 28, 2012, 1:23pm

Hi, I am also a newbie in awk and trying to find solution of my problem.

I have one reference file 1.txt with 2 columns and I want to search other 10 files (a.txt, b.txt......h.txt each with 5 columns) corresponding to the values of 2nd column from 1.txt. If the value from 2nd column from 1.txt matches with the value of 4th column of 10 files, then print the row as well as file name.
Also, in 1.txt for eg. 1st value is -191.632 but originally in a.txt it is -191.6318, so I also want to print values same upto two decimal places and rest places can be any number.

1.txt:

1.35732	-191.632
1.36229	-190.8716
1.35503	-191.3254
1.35597	-191.2652

a.txt:

271640.000	 0.49000	 -0.0000036574 -191.6318 -183.82380	
271650.000	 0.49155	 0.0000033909	 -198.30111	 -198.73140	
271660.000	 0.48775	 0.0000014657	 -191.3254 -199.84910	
271670.000	 0.48212	 -0.0000004152 -195.48446	 -193.15580

Please guide.
Thanks

DGPickett · December 28, 2012, 3:48pm

You can 'join' file 1.txt to each of the [a-h].txt in a 'for' loop, and process the 'for' output piped to shell 'while read'. The file name will be in the 'for' variable and the file columns will be all present in the 'read' variables. You have to 'sort' every file on the key column using a 'binary' sort (export LC_ALL=C, not a numeric sort). Hopefully the original line order is not critical, else number the lines in a new field. While you can join using a pile of awk or shell commands, this is cleaner.

Man Page for join (opensolaris Section 1) - The UNIX and Linux Forums

Man Page for sort (all Section 1) - The UNIX and Linux Forums

bioinfo · December 28, 2012, 6:12pm

Thanks for the reply.
Can you please help in writing the code as I am not expert in awk.

Thanks again

Don_Cragun · December 28, 2012, 6:47pm

bioinfo:

Hi, I am also a newbie in awk and trying to find solution of my problem.

I have one reference file 1.txt with 2 columns and I want to search other 10 files (a.txt, b.txt......h.txt each with 5 columns) corresponding to the values of 2nd column from 1.txt. If the value from 2nd column from 1.txt matches with the value of 4th column of 10 files, then print the row as well as file name.
Also, in 1.txt for eg. 1st value is -191.632 but originally in a.txt it is -191.6318, so I also want to print values same upto two decimal places and rest places can be any number.

1.txt:

1.35732 -191.632
1.36229 -190.8716
1.35503 -191.3254
1.35597 -191.2652

a.txt:

271640.000 0.49000 -0.0000036574 -191.6318 -183.82380
271650.000 0.49155 0.0000033909 -198.30111 -198.73140
271660.000 0.48775 0.0000014657 -191.3254 -199.84910
271670.000 0.48212 -0.0000004152 -195.48446 -193.15580

Please guide.
Thanks

I'm not sure if you want the values in 1.txt column 2 and a-h.txt column 4 truncated to two decimal places or rounded to two decimal places (with your sample input, the results are the same) and I'm not sure why DGPickett thinks join and sort would be easier than awk, but here are ways to use awk to do what I think you're requesting...

echo "awk with rounded values"
awk ' FNR == NR {v[sprintf("%.2f", $2)]}
sprintf("%.2f", $4) in v {print $0, FILENAME}' 1.txt [a-h].txt

echo "awk with truncated values"
awk '
function trunc(val) {
        split(val, a, /[.]/)
        return a[1] "." substr(a[2] "00", 1, 2)
}
FNR == NR {v[trunc($2)]}
trunc($4) in v {print $0, FILENAME}' 1.txt [a-h].txt

bioinfo · December 28, 2012, 7:10pm

Thanks for the reply.
Can you please explain it somewhat.

Thanks again.

Don_Cragun · December 28, 2012, 10:16pm

1  echo "awk with rounded values"
2  awk ' FNR == NR {v[sprintf("%.2f", $2)]; next}
3  sprintf("%.2f", $4) in v {print $0, FILENAME}' 1.txt [a-h].txt
4
5  echo "awk with truncated values"
6  awk '
7  function trunc(val) {
8          split(val, a, /[.]/)
9          return a[1] "." substr(a[2] "00", 1, 2)
10 }
11 FNR == NR {v[trunc($2)]; next}
12 trunc($4) in v {print $0, FILENAME}' 1.txt [a-h].txt

I have added line numbers to aid in this discussion, but note that the line numbers cannot appear in the script when you run it.

Also note that I have added an awk next command to lines 2 and 11. With the given sample data it won't make any difference, but with other data or with different fields being checked, it could be important.

In the suggestion on lines 1-3, the sprint("%.2f", arg) converts the string specified by arg to a floating point value and produces a string that represents that floating point value rounded to two digits after the decimal point. Line two uses that to create an array with indices that are the rounded floating point values of the second field ($2) in the first input file (lines where the record number within the file [FNR] is equal to the line number of all records read by awk [NR]).

(The next command I added here causes awk to skip to the next record instead of checking whether or not any remaining commands in the script should be executed. Without the next , the next line will process lines from all input files. It doesn't affect processing here because there is no field 4 in file one. The empty field 4 will be converted to 0.00 and none of the strings in the second field in the 1.txt will be converted to 0.00.)

Line 3 tests whether the same conversion used in line 2 produces a string that is an index in the array v (index in array evaluates to TRUE if index if is an index in the array named array . So, if $4 (rounded to two decimal places) in any of the files after the 1st file match $2 (rounded to two decimal places) in the first file, the print command will be run printing the current input line ($0) and the name of the file containing the line (FILENAME).

The 1.txt [a-h].txt on lines 3 and 12 specifies the eleven input files to be processed by these awk scripts.

The suggestion on lines 5-12 uses the same logic as the 1st suggestion but truncates the strings to two decimal places instead of rounding to two decimal places. Since the truncation logic is more complex than the single function call to sprint() used to perform the rounding, I wrote a function (lines 7-10) to convert the string to a string representing a floating point value with two decimal places.

The split() on line 8 creates an array of one or two elements with the first element containing all of the characters before the "." and the second element containing all of the characters after the ".". If there is no "." in the input value, the first element of the array will contain the entire input string and the second element of the array will not be set (and when referenced will act as an empty string). The return command on line 9 returns a string that is the concatenation of the first element in the array, a decimal point, and the 1st two characters of the concatenation of the second element of the array followed by "00". (The concatenation with "00" takes care of cases where field 2 in the first file or field 4 in the remaining files have an integer value with no decimal point and the case where the input field has a period but there are less than two digits after the decimal point.)

The logic on lines 11 and 12 is the same as the logic on lines 2 and 3.

bioinfo · December 30, 2012, 10:40pm

Hi,
Thanks a lot , I have done it.
I have got the following output for all files (just showing for one file and naming it as o.txt) :

100.000        0.51332	   0.0000001923	 -191.04738     a.txt
2000.000	   0.49573	   0.0000015512	 -191.40071     a.txt
1000.000	   0.51047	   0.0000028339	 -190.92254     a.txt

Further, I need your help. I have 10 more files, all of same format (11.txt) as follows, showing 2 repeats from this file:

ATOM      1  N    SER A   1      35.092  83.194 140.076  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.216  83.725 138.725  1.00  0.00           C  
ATOM      3  C    SER A   1      36.530  84.485 138.538  1.00  0.00           C  
TER
ENDMDL
ATOM      1  N   SER A   1      35.683  81.326 139.778  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.422  82.736 139.929  1.00  0.00           C  
ATOM      3  C   SER A   1      36.497  83.588 139.247  1.00  0.00           C  
TER
ENDMDL

                 ENDMDL is coming around 10000 times in each file. If I give input of 100 at $1 from o.txt, then it should output the first repeat from 11. txt ending with ENDMDL.

ATOM      1  N    SER A   1      35.092  83.194 140.076  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.216  83.725 138.725  1.00  0.00           C  
ATOM      3  C    SER A   1      36.530  84.485 138.538  1.00  0.00           C  
TER
ENDMDL

So, corresponding to first column of o.txt, I want to retreive the repeat at the number $1/100 from 11.txt i.e. if $1=2000, then I want to retreive the pattern where ENDMDL is at 20 place.

Please guide me.

Thanks again

---------- Post updated at 10:40 PM ---------- Previous update was at 09:52 PM ----------

Please guide me. Its urgent.

Thanks

Don_Cragun · December 31, 2012, 2:15am

bioinfo:

Hi,
Thanks a lot , I have done it.
I have got the following output for all files (just showing for one file and naming it as o.txt) :
100.000 0.51332 0.0000001923 -191.04738 a.txt
2000.000 0.49573 0.0000015512 -191.40071 a.txt
1000.000 0.51047 0.0000028339 -190.92254 a.txt

Further, I need your help. I have 10 more files, all of same format (11.txt) as follows, showing 2 repeats from this file:
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C
ATOM 3 C SER A 1 36.530 84.485 138.538 1.00 0.00 C
TER
ENDMDL
ATOM 1 N SER A 1 35.683 81.326 139.778 1.00 0.00 N
ATOM 2 CA SER A 1 35.422 82.736 139.929 1.00 0.00 C
ATOM 3 C SER A 1 36.497 83.588 139.247 1.00 0.00 C
TER
ENDMDL

ENDMDL is coming around 10000 times in each file. If I give input of 100 at $1 from o.txt, then it should output the first repeat from 11. txt ending with ENDMDL.
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C
ATOM 3 C SER A 1 36.530 84.485 138.538 1.00 0.00 C
TER
ENDMDL

So, corresponding to first column of o.txt, I want to retreive the repeat at the number $1/100 from 11.txt i.e. if $1=2000, then I want to retreive the pattern where ENDMDL is at 20 place.

Please guide me.

Thanks again

---------- Post updated at 10:40 PM ---------- Previous update was at 09:52 PM ----------

Please guide me. Its urgent.

Thanks

First, let me be very clear: I am a volunteer in this forum. Nothing that you ask me to do is urgent. If you need me to consider stuff that you'd like me to do for you urgent, you need to put me on your payroll!

I'm not sure I understand what you want. Am I correct in making the following assumptions:

The input for this assignment is a file named o.txt .
The first field of each line in o.txt is of the form x00.000 with 1 <= x <= 10000.
For each line read from o.txt , the xth entry from file 11.txt is to be written to standard output where each entry in 11.txt is terminated by a line containing only ENDMDL .
In addition to 11.txt , there are 9 more files like it in the same format as 11.txt that are to be ignored.
You have already verified that the value in the first field of o.txt will correspond to an existing entry in 11.txt (i.e., I don't need to worry about negative values in the 1st field of o.txt , values in that field that don't end with "00.000", nor values before the "00.000" that identify a number greater than the number of times "ENDMDL" appears in 11.txt ).

Are these assumptions correct?

If the above assumptions are all correct, the following script should do what you want:

#!/bin/ksh
awk 'BEGIN {rc = 1}
FNR == NR {r[rc] = r[rc] $0 "\n"
        if($0 == "ENDMDL") rc++
        next}
{       printf("%s", r[$1])}' 11.txt FS='00[.]000' o.txt

As always, if you're using a Solaris system, use /usr/xpg4/bin/awk or nawk instead of awk .

On some awk implementations, setting the array r could be simplified by setting RS to "ENDMDL" before processing 11.txt, but the standards only define the behavior when RS is set to a single character or to the empty string. The awk on OS X (which I use for testing when I'm working on solutions for issues raised in this forum) is one of the implementations that only uses the first character of RS values as the record separator.

bioinfo · December 31, 2012, 11:02am

Hi,
Thanks again for guidance.
Sorry, I did not mean to hurt anyone.

Most of your assumptions are correct and I wish to make some of them more clear:
2. The values of x are not in a sequence, but surely positive. For e.g. 2000, 7000, 3000, 1982480 (for bigger files) etc.
3. For each each value of first field (x) from o.txt and dividing it by 100, I wish to retreive corresponding entry from 11.txt ending with ENDMDL. That means, if the value of x is 1000.000, then I wish to divide it by 100 and then retreiving 10th entry from 11.txt.

Please explain the concept of rc.

Thanks again.

DGPickett · December 31, 2012, 11:16am

rc means record counter.

Don_Cragun · December 31, 2012, 12:10pm

The script I provided in message #8 in this thread assumes that the first field in o.txt has the values 7000.000, 3000.000, and 1982400.000 (not 1982480.000 or 1982480) to get the 70th, 30th, and the 19,824th entry from 11.txt. If the 1st field in o.txt does not end with 00.000, the current script won't print anything for that line in o.txt. If you have values like 1982480 which is not evenly divisible by 100, you need to explain if the value is to be skipped, truncated, or rounded to determine which entry from o.txt to print? (In other words since there is no entry numbered 19,824.80, do you want nothing to be printed, do you want the result of the division truncated to return the 19,824th entry, or do you want it rounded to return the 19,825th entry?) Why did all entries in you sample o.txt file end with 00.000 if you are saying that the values in the value are sometimes integers and that the values aren't evenly divisible by 100?

The script I provided does not assume that the values in the first field from o.txt are in sequence; with the data you gave as a sample it will print the 1st, 20th, and 10th entries from 11.txt in that order.

In the script I provided, rc is the number entries that have been read from 11.txt plus one. So when the script starts reading lines from 11.txt, the lines will be accumulated into r[1] until after the line containing ENDMDL is added to the entry. Then rc will be incremented so that subsequent lines will be added to the next entry...

bioinfo · December 31, 2012, 12:52pm

1st field in o.txt does end with 00.000
Sorry, I did not know that its too critical to include 00.000

Now I checked manually in all my files, except one value which is 5390.001, all other ends with .000

As you suggested a good point, (even I have not thought of it ) regarding value 19,824.80 and three options:
(1) nothing to be printed,
(2) result of the division truncated to return the 19,824th entry, or
(3) rounded to return the 19,825th entry

So, I wish to retreive both entries 19,824th and 19,825 in one file as well as 3 other files with above options.

That means for 1st field values that are non-divisible by 100, I wish to retreive one file containing nothing for them, one file with truncated, one file with rounded and fourth file with both truncated and rounded values (but each of these 4 files must have divisible entries too).

Thanks

Don_Cragun · December 31, 2012, 2:12pm

It isn't critical, I just took advantage of it since your sample input was always in this form.

OK. So I have to do some arithmetic instead of letting awk treat 00.000 as a field separator.

In the file with both entries, if the truncated and rounded entries are the same, do you want that entry printed twice, or just once? (For example, 5310.000 truncates to the 53rd entry and rounds to the 53rd entry.)

What names do you want for these four files?
In the file that has both rounded and truncated entries, do you want any kind of marker added to the output entries indicating that there are two output record for a single input line? If so, what should the marker be?
In the file with truncated and rounded entries, do you want any kind of marker added to the output entries indicating that the record from 11.txt was selected based on truncating a value or rounding a value, respectively? If so, what should the markers be?
In the file with nothing for values that are not evenly divisible by 100, do you want any kind of marker in the output to show that an entry was skipped? If so, what should the markers be?
Do you want one of these four files to be written to standard output, or do you want all output to be written directly to the four files?

If you want markers, it would be relatively easy to include markers of the form:

Following entry (%d) comes from %s truncated:
Following entry (%d) comes from %s rounded:
Entry skipped because %s is not evenly divisible by 100.

where the %d is replaced by the entry number of the following lines and %s is replaced by the 1st field in o.txt , if that is what you want.

bioinfo · December 31, 2012, 2:27pm

They can be printed once.

File names can be no.txt, trun.txt, round.txt, tro.txt

In the file that has both rounded and truncated entries, do you want any kind of marker added to the output entries indicating that there are two output record for a single input line? If so, what should the marker be?

In the file with truncated and rounded entries, do you want any kind of marker added to the output entries indicating that the record from 11.txt was selected based on truncating a value or rounding a value, respectively? If so, what should the markers be?

In the file with nothing for values that are not evenly divisible by 100, do you want any kind of marker in the output to show that an entry was skipped? If so, what should the markers be?

Do you want one of these four files to be written to standard output, or do you want all output to be written directly to the four files?
If you want markers, it would be relatively easy to include markers of the form:

Yes, I wish to have markers. All output should be directly written to four files.

Thanks

Don_Cragun · December 31, 2012, 7:32pm

I believe the following script does what you want:

#!/bin/ksh
no=${1:-no.txt}         # name of file for no entry if $1%100 != 0
to=${2:-trun.txt}       # name of file for truncated $1 entries
ro=${3:-ro.txt}         # name of file for rounded $1 entries
bo=${4:-tro.txt}        # name of file for both rounded& & runcated $1 entries
awk -v bo="$bo" -v no="$no" -v ro="$ro" -v to="$to" 'BEGIN {rc = 1}
FNR == NR {r[rc] = r[rc] $0 "\n"
    if($0 == "ENDMDL") rc++
    next}
{   # If we got to here, we are reading lines from the 2nd file.
    # Determine exact, truncated, and rounded entry numbers.
    if (substr($1, length($1) - 5) == "00.000") {
        # $1 ends in 00.000; no truncation or rounding needed.
        entry = substr($1, 1, length($1) - 6)
        round = trunc = 0
    } else {
        # $1 is not evenly divisible by 100; calculate rounded and truncated
        # values.
        entry = 0
        round = sprintf("%.0f", $1 / 100)
        trunc = substr($1, 1, length($1) - 6)
    }
    # Determine which markers and entries to print in each output file.
    if(entry) {
        # No rounding and no truncation involved.  Write the appropriate entry
        # to each output file.
        printf("%s", r[entry]) > bo
        printf("%s", r[entry]) > no
        printf("%s", r[entry]) > ro
        printf("%s", r[entry]) > to
    } else {
        # Rounding and truncation performed; Prepare shared markers.
        rm = sprintf("Following entry (%d) comes from %s rounded:", round, $1)
        tm = sprintf("Following entry (%d) comes from %s truncated:", trunc, $1)

        # Write appropriate markers and/or entries for each output file.
        printf("%s\n%s", tm, r[trunc]) > bo
        if(trunc != round) printf("%s\n%s", rm, r[round]) > bo
        printf("Entry skipped because %s is not evenly divisible by 100.\n",
            $1) > no 
        printf("%s\n%s", rm, r[round]) > ro 
        printf("%s\n%s", tm, r[trunc]) > to
    }
}' 11.txt o.txt

Note that it still assumes that the 1st field in o.txt is formatted as a floating point number with three digits after the radix point. If you save this script in a file (for example split4 ), make it executable:

chmod +x split4

edit it to change /bin/ksh in the first line of the script to be an absolute pathname to the Korn shell on your system (if it isn't in /bin/ksh ), and run it:

./split4

it will create your four output files no.txt , ro.txt , tro.txt , and trun.txt from the input files 11.txt and o.txt . Note that it will overwrite each of these four files each time you run the script; not append to them. If you want to use different file names for the output files, run it as:

./split4 no_round_or_trunc truncated rounded rounded_and_truncated

to specify alternative output file names (note that the order is important).

bioinfo · December 31, 2012, 8:51pm

Thanks. I will try to run and let you know.

Anyways, Wishing you and the members of the forum a very HAPPY NEW YEAR 2013.

---------- Post updated at 08:51 PM ---------- Previous update was at 07:48 PM ----------

I am getting error:

'for reading <No such file or directory>  'o.txt

But, I am working in the same directory where all files are there.

Also, if I include the comments in the end of script using #, then also I get errors:

'for reading <No such file or directory>  'o.txt
split4.sh: : line 45: $'\r': command not found

line 45 (line where comments start)

Don_Cragun · December 31, 2012, 9:28pm

From the looks of the error message (note the single quote at the start of the line that seems like it should be at the end of the file name), it looks like you ended up with a <carriage-return> character at the end of "o.txt" in the last line of split4.sh and it is trying to open a file with the name <o><period><t><x><t><carriage-return>. To verify this try running the command:

od -c split4

and look for a \r before the \n that should be at the end of the file. This suspicion is supported by the later error saying that the command $'\r' can't be found.

If you find one or more occurrences of \r in the output from od , run the commands:

cp split4.sh _split4.sh
tr -d '\r' < _split4.sh > split4.sh

and then try running split4 again.

Corona688 · January 1, 2013, 4:45am

Errors with \r in them mean "stop editing your scripts in Microsoft Notepad".

bioinfo · January 1, 2013, 2:45pm

I am using Notepad ++

bioinfo · January 1, 2013, 3:25pm

Thanks a lot Don Cragun and Corona688. I edited script in vi and its working. Yippie

I have one more query. I am using the following tro.txt as my input file for further program:

Following entry (2659) comes from 265920.000 truncated:
ATOM      1  N   SER A   1     117.041 155.383 146.906  1.00  0.00           N  
ATOM      2  CA  SER A   1     115.956 155.933 147.729  1.00  0.00           C  
ATOM      3  C   SER A   1     116.331 155.850 149.194  1.00  0.00           C  
TER
ENDMDL
Following entry (2703) comes from 270330.000 rounded:
ATOM      1  N   SER A   1     122.255 148.746 136.780  1.00  0.00           N  
ATOM      2  CA  SER A   1     122.237 147.748 137.846  1.00  0.00           C  
ATOM      3  C   SER A   1     121.916 148.457 139.169  1.00  0.00           C  
TER
ENDMDL
Following entry (2703) comes from 270360.000 rounded:
..........................................................................
..........................................................................

I wish to delete all following lines in this file:

Following entry (2659) comes from 265920.000 truncated:
Following entry (2703) comes from 270330.000 rounded:
Following entry (2703) comes from 270360.000 rounded:
..........................................................................
..........................................................................

Required output:

ATOM      1  N   SER A   1     117.041 155.383 146.906  1.00  0.00           N  
ATOM      2  CA  SER A   1     115.956 155.933 147.729  1.00  0.00           C  
ATOM      3  C   SER A   1     116.331 155.850 149.194  1.00  0.00           C  
TER
ENDMDL
ATOM      1  N   SER A   1     122.255 148.746 136.780  1.00  0.00           N  
ATOM      2  CA  SER A   1     122.237 147.748 137.846  1.00  0.00           C  
ATOM      3  C   SER A   1     121.916 148.457 139.169  1.00  0.00           C  
TER
ENDML

Please guide.
Thanks.