Extract lines from text files

I have some files containing the following data

 # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 
1 196 A M 0 0 230 0, 0.0 2,-0.2 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 76.4 21.7 -6.8 11.3
2 197 A D + 0 0 175 1,-0.1 2,-0.1 0, 0.0 0, 0.0 -0.193 360.0 151.5 -46.2 99.1 23.2 -9.3 13.8
3 198 A E - 0 0 170 -2,-0.2 -1,-0.1 0, 0.0 0, 0.0 -0.622 29.3-158.9-134.6 66.9 26.9 -9.0 13.0
4 199 A K - 0 0 161 1,-0.1 0, 0.0 -2,-0.1 0, 0.0 0.037 18.7-134.6 -43.9 157.1 28.8 -9.8 16.3
5 200 A R + 0 0 174 3,-0.0 2,-1.6 2,-0.0 -1,-0.1 0.294 60.4 134.1 -97.8 0.9 32.4 -8.5 16.6
6 201 A R + 0 0 178 1,-0.1 -2,-0.1 2,-0.1 0, 0.0 -0.429 24.5 143.8 -54.0 86.9 33.5 -11.9 17.9
7 202 A A + 0 0 76 -2,-1.6 -1,-0.1 2,-0.1 -2,-0.0 -0.471 24.7 108.8-134.5 48.7 36.5 -11.8 15.5
8 203 A Q S S+ 0 0 149 3,-0.0 2,-0.1 4,-0.0 -2,-0.1 -0.694 77.8 88.8-115.4 54.1 39.3 -13.4 17.4
9 204 A H S >> S- 0 0 121 4,-0.0 3,-2.1 0, 0.0 4,-0.7 -0.341 88.3 -9.7-128.0-146.1 38.5 -16.0 14.8
10 205 A N H 3> S+ 0 0 145 1,-0.3 4,-0.8 2,-0.2 5,-0.2 0.673 125.2 50.8 -27.9 -50.8 39.4 -17.0 11.2
11 206 A E H 34 S+ 0 0 159 1,-0.2 4,-0.3 2,-0.1 -1,-0.3 0.843 106.1 59.4 -64.2 -34.5 41.5 -13.9 10.2
12 207 A V H X4 S+ 0 0 60 -3,-2.1 3,-0.5 2,-0.1 4,-0.4 0.982 107.8 32.9 -62.8 -61.2 43.7 -14.0 13.3
13 208 A E H >X S+ 0 0 78 -4,-0.7 3,-4.0 1,-0.2 4,-0.9 0.950 109.6 53.5 -70.0 -62.3 45.4 -17.4 13.2

Desired output

ASG  ILE A   99    2    C          Coil    -82.86    141.16      97.1      1N8W
ASG  LEU A  146   48    C          Coil    -68.82    158.46       0.0      1N8W
ASG  LEU A  302  167    E        Strand    -98.11    143.77      19.7      1N8W

I want to extract the lines only if the values in the phi and psi columns between -67<=phi<=-99 and 100<=psi<=165
I would like to save the outputs in to another folder f2 with the input file names. I highly appreciate your valuable suggestions.

Thanks a lot.

$ awk 'NR==1{print;next}$15>=-67 && $15<=-99 && $16>=100 && $16<=165' file
awk 'NR==1||($(NF-3)>=100&&$(NF-3)<=165&&$(NF-4)>=-67&&$(NF-4)<=-99)' file

There are several problems here. First, and most importantly, your specification requiring a value for PHI that is greater than -67 and simultaneously less than -99 (-67<=phi<= -99) always yields the empty set.

If we assume that you meant -99 <= PHI <= -67, your sample data still produces no output (except for the heading) because only the fifth line of your input file has a PSI value between 100 and 165, and the PHI value on that line is -43.9 (which is out of range). These values are marked in red above.

When Akshay provided his suggested code, he apparently didn't notice that the data under the heading "STRUCTURE" looks like 0, 1, 2, or 3 fields to awk (when using the default field delimiter). Yoda compensated for that problem, but apparently didn't notice that sometimes there are no field delimiters between values under the headings KAPPA, ALPHA, PHI, and PSI. Some samples of this problem are marked in green above. So, rather than using field delimiters, any code processing these lines will have to be based on column positions in the input file; not field counts.

Are there ever any <tab> characters in your input files? Or, are all of the spaces between fields just sequences of <space> characters?

Please provide us with a specification that doesn't always produce an empty set, and provide us some sample input that includes some lines that will be selected as well as some lines that will be rejected. And, show us the sample output you expect to be produced for that sample input.

And, please tell us how the name of the directory to contain the new files will be passed to your script.

2 Likes

Yup! We didn't notice. Thank you Don

Hi Cragun,

Thank you very much for your suggestions. I have rephrased my question below and changed the data. Please have a look at the question.

I have a folder f1 that contains some files. The content of the files are shown below.

REM  |---Residue---|    |--Structure--|   |-Phi-|   |-Psi-|  |-Area-|      1N8W
ASG  GLU A   98    1    C          Coil    360.00    145.18     236.2      1N8W
ASG  ILE A   99    2    C          Coil    -82.86    141.16      97.1      1N8W
ASG  ILE A  100    3    C          Coil   -115.85    140.04      33.4      1N8W
ASG  GLN A  101    4    E        Strand   -114.08    115.71      61.8      1N8W
ASG  GLY A  127   29    C          Coil    149.12    153.69      21.1      1N8W
ASG  GLU A  128   30    T          Turn    -81.07    168.08     150.8      1N8W
ASG  PHE A  129   31    T          Turn    -55.84    139.19      85.7      1N8W
ASG  CYS A  144   46    H    AlphaHelix    -67.95    -16.88       0.0      1N8W
ASG  GLN A  145   47    C          Coil    -86.59    -11.10      29.5      1N8W
ASG  LEU A  146   48    C          Coil    -68.82    158.46       0.0      1N8W
ASG  PRO A  147   49    C          Coil    -61.30    150.63      46.7      1N8W
ASG  ILE A  148   50    G      310Helix    -57.27    -35.92      84.1      1N8W
ASG  TYR A  301  166    E        Strand   -110.40    111.53      75.1      1N8W
ASG  LEU A  302  167    E        Strand    -98.11    143.77      19.7      1N8W

Desired output

ASG  ILE A   99    2    C          Coil    -82.86    141.16      97.1      1N8W
ASG  LEU A  146   48    C          Coil    -68.82    158.46       0.0      1N8W
ASG  LEU A  302  167    E        Strand    -98.11    143.77      19.7      1N8W

I want to extract the lines only if the values in the phi and psi columns between -67<=phi<=-99 and 100<=psi<=165
I would like to save the outputs in to another folder f2 with the input file names. I highly appreciate your valuable suggestions.

Thanks a lot.

$ awk '$8>=-99 && $8<=-67 && $9>=100 && $9<=165' file

This will give you desired output

$ awk '$8>=-99 && $8<=-67 && $9>=100 && $9<=169' file

Combining all the inputs of the posters in this thread, we would end up with something like this:

awk '$8>=-99 && $8<=-67 && $9>=100 && $9<=165 || NR==1' file

Note that the 2nd line in your output specification is out of range (>165).

----
*edit I see Akshay has already posted the same suggestion

Hi scrutinizer,

Thank you. I have corrected it.

Hi Edweena,
I repeat: The specification that PHI must be in the range "-67<=phi<=-99" means that there must be no output. There is no value of phi that is both greater than -67 and less than -99! The following code looks for PHI to be in the range -99<=PHI<=-67 and produces the output you requested. It reads from files in directory f1 and writes files into directory f2 with the final component of both paths being the same filename.

awk -v dest_dir=f2 '
FNR == 1 || ($8 >= -99 && $8 <= -67 && $9 >= 100 && $9 <= 165) {
        if(FNR == 1 && file != "")
                # Close previous output file.
                close(file)
        if(FNR == 1) {
                # Set output pathname for current file.
                # Start with input pathname.
                file = FILENAME
                # Strip off directories.
                sub(/.*\//, "", file)
                # Add destination directory.
                file = dest_dir "/" file
        }
        if(FNR != 1) # Delete this line to add header line to output files.
                print > file
}' f1/*

If you want to try this on a Solaris/SunOS system, use /usr/xg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk .

Hi Don Cragun,

Thank you very much for your help.