Search field in text file and replace value

Hi there,

First of all this is my first post here. Thank you in advance for your help.

What I am trying to do is the following. I have a text file where each field of each row is separated by a tabulator.

Looks like this:

ATOM      1  N   HSE A  26       3.033 -10.429  -2.262  1.00 17.07           N1+
ATOM      2  CA  HSE A  26       3.226 -11.674  -3.040  1.00 14.73           C  
ATOM      3  CB  HSE A  26       4.705 -11.978  -3.127  1.00 15.52           C  
ATOM      4  CG  HSE A  26       5.055 -13.031  -4.057  1.00 15.51           C  
ATOM      5  ND1 HSE A  26       4.959 -14.364  -3.715  1.00 15.39           N  
ATOM      6  CE1 HSE A  26       5.349 -15.091  -4.746  1.00 17.55           C  
ATOM      7  NE2 HSE A  26       5.765 -14.285  -5.726  1.00 21.97           N  
ATOM      8  CD2 HSE A  26       5.577 -12.980  -5.296  1.00 18.48           C  
ATOM      9  C   HSE A  26       2.538 -12.795  -2.235  1.00 13.15           C  
ATOM     10  O   HSE A  26       2.537 -12.755  -1.031  1.00 13.11           O  
ATOM     11  H1  HSE A  26       3.422 -10.546  -1.337  1.00 17.07           H  
ATOM     12  H2  HSE A  26       2.046 -10.227  -2.189  1.00 17.07           H  
ATOM     13  H3  HSE A  26       3.499  -9.664  -2.729  1.00 17.07           H  
ATOM     14  HA  HSE A  26       2.818 -11.585  -4.047  1.00 14.73           H  
ATOM     15  HB2 HSE A  26       5.049 -12.273  -2.136  1.00 15.52           H  
ATOM     16  HB3 HSE A  26       5.221 -11.068  -3.435  1.00 15.52           H  
ATOM     17  HD2 HSE A  26       5.808 -12.085  -5.855  1.00 18.48           H  
ATOM     18  HE2 HSE A  26       6.146 -14.573  -6.616  1.00 21.97           H  
ATOM     19  HE1 HSE A  26       5.334 -16.170  -4.790  1.00 17.55           H  
ATOM     20  N   PRO A  27       1.965 -13.801  -2.950  1.00 14.19           N  
ATOM     21  CA  PRO A  27       1.227 -14.887  -2.217  1.00 14.75           C  
ATOM     22  CB  PRO A  27       0.797 -15.859  -3.316  1.00 17.54           C  
ATOM     23  CG  PRO A  27       0.763 -15.036  -4.490  1.00 19.69           C  
ATOM     24  CD  PRO A  27       1.755 -13.904  -4.376  1.00 16.62           C  
ATOM     25  C   PRO A  27       2.086 -15.623  -1.216  1.00 13.14           C  
ATOM     26  O   PRO A  27       1.601 -16.109  -0.212  1.00 13.57           O  
ATOM     27  HA  PRO A  27       0.404 -14.463  -1.642  1.00 14.75           H  
ATOM     28  HB2 PRO A  27      -0.187 -16.278  -3.104  1.00 17.54           H  
ATOM     29  HB3 PRO A  27       1.520 -16.668  -3.427  1.00 17.54           H  
ATOM     30  HG2 PRO A  27      -0.239 -14.623  -4.609  1.00 19.69           H  
ATOM     31  HG3 PRO A  27       1.011 -15.643  -5.360  1.00 19.69           H  
ATOM     32  HD2 PRO A  27       2.684 -14.143  -4.893  1.00 16.62           H  
ATOM     33  HD3 PRO A  27       1.343 -12.979  -4.779  1.00 16.62           H  

First what I do is find the last row which starts with ATOM and save the field value of the 6th column in this row:

last=$[$(grep 'ATOM' test.pdb | tail -n1 | awk '{ print $6 }')+1]

Then I want to search for a value I define in 6th column and replace this value by another value. This should be done until I reach the last row which starts with ATOM. Can you somehow use AWK or SED to search specifically for the row? I am new to shell scripting so sorry if the question is too trivial.

Thanks for help,

Max

---------- Post updated at 06:41 PM ---------- Previous update was at 06:18 PM ----------

Sorry the fields are not separated by tabs it is:

COLUMNS        DATA  TYPE    FIELD        DEFINITION
-------------------------------------------------------------------------------------
 1 -  6        Record name   "ATOM  "
 7 - 11        Integer       serial       Atom  serial number.
13 - 16        Atom          name         Atom name.
17             Character     altLoc       Alternate location indicator.
18 - 20        Residue name  resName      Residue name.
22             Character     chainID      Chain identifier.
23 - 26        Integer       resSeq       Residue sequence number.
27             AChar         iCode        Code for insertion of residues.
31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms.
39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms.
47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms.
55 - 60        Real(6.2)     occupancy    Occupancy.
61 - 66        Real(6.2)     tempFactor   Temperature  factor.
77 - 78        LString(2)    element      Element symbol, right-justified.
79 - 80        LString(2)    charge       Charge  on the atom.

I am interested in the row 23 - 26 (Residue sequence number)

Try:

 awk '/^ATOM/ { v=substr($0,23,4) } END{print v}' file

To replace values try:

awk -vF=27 -vT=28 '/^ATOM/&&substr($0,23,4)+0==F{$0=substr($0,1,22) sprintf("%4d",T) substr($0,27)}1' file

Hi,

thanks for your answer. I just figured out a way to do it:

#!/bin/bash

cat /dev/null > out.txt
cat $2 | while read line; do 
      id=`echo $line | awk '{ print $6 }'`
      atom=`echo $line | awk '{ print $1 }'`
      if ! [[ "$atom" == "ATOM" ]] ; then
      echo "$line" >> out.txt
      else
      if [ $id -gt $1 ]
      then
        rid=$[$id-2]
        echo "$line" | sed "s/$id/$rid/" >> out.txt
      else
        echo "$line" >> out.txt
      fi
      fi
  done

where $1 is the id starting from which i want to change it and $2 the input file name

---------- Post updated at 08:00 PM ---------- Previous update was at 07:51 PM ----------

the only thing what I just realized is that if

ATOM   1181  N   ASN A 100      10.938  11.671  38.632  1.00  8.17           N

is to be changed to 99 I get

ATOM   1181  N   ASN A 99      10.938  11.671  38.632  1.00  8.17           N

but i need an additional space :confused:

ATOM   1181  N   ASN A  99      10.938  11.671  38.632  1.00  8.17           N
ATOM   1181  N   ASN A 99      10.938  11.671  38.632  1.00  8.17           N

Danger with your code is that if the ID appears in another field (like serial number, location or name) It will replace that instead.

yeah you are right. I just work this out too

---------- Post updated 02-14-13 at 12:09 AM ---------- Previous update was 02-13-13 at 08:03 PM ----------

I am pretty sure that my code is horrible and that there are 1000x better ways to do what I did but this works now. I took into consideration that the number could appear in another spot so now it is impossible. I thought I just share maybe somebody else get inspired or wants to tell me how to solve it in a more appropriate manor.

while getopts ":f:o:c:" opt; do
  flags=1
  case $opt in
    f)
      inputfile=$OPTARG
      ;;
    o)
      outputfile=$OPTARG
      ;;
    c)
      if [[ $OPTARG == "" ]]
      then
      	consec=0
      else
      consec=$OPTARG
      fi
      ;;
    \?)
      echo "Invalid option: -$OPTARG" >&2
      exit
      ;;
  esac
done


if [[ $flags == "" ]]
then
	echo "Usage: PDBid_change -f -o [-c]"
	echo "-f input file name"
	echo "-o output file name"
	echo "-c start id from 0 or value"
	exit
fi

if [[ $inputfile == "" ]]
then
	echo "Please provide an input file name (-f filename)"
	exit
else
	if ! [[ -e "$inputfile" ]]
	then
		echo "Input file does not exits"
		exit
	fi
fi

if [[ $outputfile == "" ]]
then
	echo "Please provide an output file name (-o filename)"
	exit
fi

cat /dev/null > $outputfile


function write_pdb() {
if  [[ ${#newid} == ${#position} ]]
then
	if  [[ ${#newid} == 1 ]]
	then
		echo "$line" | sed "s/A   $id/A   $newid/" >> $1
	elif [[ ${#newid} == 2 ]]
	then
		echo "$line" | sed "s/A  $id/A  $newid/" >> $1
	elif [[ ${#newid} == 3 ]]
	then
		echo "$line" | sed "s/A $id/A $newid/" >> $1
	else
		echo "$line" | sed "s/A$id/A$newid/" >> $1
	fi
elif [[ ${#newid} == 1 && ${#position} == 2 ]]
then
	echo "$line" | sed "s/A  $id/A   $newid/" >> $1
elif [[ ${#newid} == 1 && ${#position} == 3 ]]
then
	echo "$line" | sed "s/A $id/A   $newid/" >> $1
elif [[ ${#newid} == 1 && ${#position} == 4 ]]
then
	echo "$line" | sed "s/A$id/A   $newid/" >> $1
elif [[ ${#newid} == 2 && ${#position} == 3 ]]
then
	echo "$line" | sed "s/A $id/A  $newid/" >> $1
elif [[ ${#newid} == 2 && ${#position} == 4 ]]
then
	echo "$line" | sed "s/A$id/A  $newid/" >> $1
else
	echo "$line" | sed "s/A$id/A $newid/" >> $1
fi
}


cat $inputfile | while read line; do 

	atom=`echo $line | awk '{ print $1 }'`
	if  [[ "$atom" != "ATOM" && "$atom" != "TER" ]] 
	then
		echo "$line" >> $outputfile
	else
	 	
	 	if  [[ "$position" == "" ]]
	  	then
	  	position=`echo $line | awk '{ print $6 }'`
	  		if [[ "$consec" != "" ]]
	  		then
      		previous=$consec
      		else
      		previous=`echo $line | awk '{ print $6 }'`-5
      		fi      		
     	fi
     	
     	id=`echo $line | awk '{ print $6 }'`
     	if  [[ "$position" == "$id" ]]
     	then
     		newid=$[$previous+1]
     		write_pdb $outputfile
       	elif [ $atom == "TER" ]
     	then
     		id=`echo $line | awk '{ print $5 }'`
     		write_pdb $outputfile
      	else
      		previous=$newid
      		position=$id
      		newid=$[$previous+1]
      		write_pdb $outputfile     		
      	fi
	fi
done

You could try this for the write_pdb():

function write_pdb() {
    echo $line | awk -vF=$id -vT=$newid '
        substr($0,23,4)+0==F{$0=substr($0,1,22) sprintf("%4d",T) substr($0,27)}1' >> $1
}

thank you very much. it works like a charm. since i am new to awk could you maybe explain your syntax? as a beginner it is hard to read the awk commands.

in addition to make it do exactly what i want i altered it to:

echo "$line" | awk -v F=$id -v T=$newid '
        substr($0,23,4)+0==F{$0=substr($0,1,22) sprintf("%4d",T) substr($0,27)}1' >> $1

with "" around $line to keep the format and replace in the right position and a space after -v to set a variable because this is what my version of awk requires.

thank you again and have a nice day

Yes, sorry about the missing quotes around $line. I posted the suggestion "on the run" and didn't have time to try it out.

Here is a quick explanation of what is happening.

$0 = Current input line
substr(string, start, len) extract a substring starting at character position start for a length of len if len is not supplied it goes to end of string.

substr($0,23,4)+0==F
This get chars 23-26 of the input line
Adds zero (this converts the value to an integer which is needed as " 21" <> 21)
Compare this value with variable F if it's a match the code in {} is processed

{$0=substr($0,1,22) sprintf("%4d",T) substr($0,27)}
This rebuilds the input line with chars 1-22 + T (padded to 4 chars wide) + chars 27-end

1
This value is not zero in awk that is a true (not false) expression so the code in {} that follows is executed. In this case no code follows so the default action is performed which is "print the input line". Note that we may have changed the input line in the statement above (or we might not have all depending on the contents of T chars 23-26 of the input line)

This could also be expanded to:

awk -vF=$id -vT=$newid '
{ 
  id=substr($0,23,4)
  sub(/^ */,"",id)
  if(id == F) print substr($0,1,22) sprintf("%4d",T) substr($0,27)
  else print $0                                                   
}'

Because your using bash you could also use bash substrings (remember index starts at zero not 1), this solution will be much faster as it doesn't need to load and execute awk for each line processed:

function write_pdb() {
    ln=${line:22:4}
    ln=${ln##* }
    [ "$ln" = "$id" ] && line="${line:0:22}$(printf "%4d" $newid)${line:26}"
    echo "$line" >> $1
}

you are awesome man! thank you!