Removal of extra spaces in *.log files to allow extraction of frequencies

wsuchem · April 21, 2013, 7:55pm

Our university has upgraded its version of a computational chemistry program that our group uses quite regularly. In the past we have been able to extract frequency spectra from log files that are generated. Since the upgrade, the viewing program errors out. I've been able to trace down the changes between the old and new log file formats. The new program adds two extra spaces to the following lines:

  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.00 0.00 0.03 0.09 -0.02 0.04 0.10 -0.01 -0.03
     2 8 0.03 -0.01 -0.12 -0.14 -0.08 -0.06 -0.14 -0.05 0.06

One space before Atom and one of the spaces between AN and Atom needs removed as there are two and should be only one space prior to each. On the next line there are five spaces and there should only be three before the atom number or the number starting the second and each successive line. The log file has each frequency broken up into columns of three frequencies so depending on how complex the molecular system is, this can become an impossible job to complete on thousands of files. Once the format is corrected, then everything opens fine and the frequencies can be extracted. The software developer is aware of this issue but advises the generation of a special portable file that would normally be used to transport data across platforms. The above data does have its tabs removed between each column as when I copied and pasted the contents, its formatting was removed. There are multiple instances of the above lines that then have a varied number of atoms listed below each column of three and the entire data set ends with a blank line where the thermochemistry data starts. Since these job files are processed via a bash script to extra thermodynamics and electronic energies, I thought it would be fairly simple to incorporate any new commands into the torque execution script. Any help would be greatly appreciated.

zozoo · April 21, 2013, 8:05pm

Hi can you post ,

exact sample data and ouput you look for using codes

so that it would be helpful to solve your problem

wsuchem · April 21, 2013, 8:15pm

The following was copied and pasted directly out of the log file but all of the tabs and spaces were removed in transit for some reason. I put the data in the code blocks but I added the spaces that I need edited, if the whole data block needs reformatted manually let me know.

1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
     2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
     3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
     4 1 0.00 0.00 -0.07 -0.12 -0.07 0.00 0.40 -0.10 0.00
     5 6 0.00 0.00 -0.19 0.00 0.24 0.00 -0.24 -0.14 0.00
     6 1 0.00 0.00 -0.23 0.32 0.36 0.00 0.14 0.14 0.00
     7 1 0.00 0.00 -0.89 0.21 0.30 0.00 -0.58 -0.24 0.00
     8 1 0.00 0.00 0.12 -0.15 0.39 0.00 0.02 -0.40 0.00
     9 1 0.00 0.00 0.18 0.53 -0.04 0.00 0.26 -0.08 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
     2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
     3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
     4 1 -0.85 0.21 0.00 0.00 0.00 -0.29 0.00 0.00 0.09
     5 6 -0.14 0.01 0.00 0.00 0.00 -0.04 0.00 0.00 0.00
     6 1 -0.04 0.02 0.00 0.00 0.00 -0.75 0.00 0.00 0.01
     7 1 -0.01 0.05 0.00 0.00 0.00 0.31 0.00 0.00 -0.27
     8 1 -0.25 0.12 0.00 0.00 0.00 -0.46 0.00 0.00 0.27
     9 1 -0.15 0.19 0.00 0.00 0.00 -0.18 0.00 0.00 -0.92

hanson44 · April 21, 2013, 8:19pm

Does the original file have tabs? Or just space characters?

The space characters will transfer fine with copy / paste.

The tab characters I'm not sure.

wsuchem · April 21, 2013, 8:24pm

They are all spaces. There aren't any tabs that I can find. I have been grabbing the text with WinSCP's internal editor. Wordpad also shows everthything aligned via spaces.

hanson44 · April 21, 2013, 8:38pm

Take a look at this and see if I am following your logic correctly:

$ cat atoms.txt
  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
     2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
     3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
     4 1 0.00 0.00 -0.07 -0.12 -0.07 0.00 0.40 -0.10 0.00
     5 6 0.00 0.00 -0.19 0.00 0.24 0.00 -0.24 -0.14 0.00
     6 1 0.00 0.00 -0.23 0.32 0.36 0.00 0.14 0.14 0.00
     7 1 0.00 0.00 -0.89 0.21 0.30 0.00 -0.58 -0.24 0.00
     8 1 0.00 0.00 0.12 -0.15 0.39 0.00 0.02 -0.40 0.00
     9 1 0.00 0.00 0.18 0.53 -0.04 0.00 0.26 -0.08 0.00

$ sed -e "s/ Atom /Atom/" -e "s/^     /   /" atoms.txt
 Atom AN X Y Z X Y Z X Y Z
   1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
   2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
   3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
   4 1 0.00 0.00 -0.07 -0.12 -0.07 0.00 0.40 -0.10 0.00
   5 6 0.00 0.00 -0.19 0.00 0.24 0.00 -0.24 -0.14 0.00
   6 1 0.00 0.00 -0.23 0.32 0.36 0.00 0.14 0.14 0.00
   7 1 0.00 0.00 -0.89 0.21 0.30 0.00 -0.58 -0.24 0.00
   8 1 0.00 0.00 0.12 -0.15 0.39 0.00 0.02 -0.40 0.00
   9 1 0.00 0.00 0.18 0.53 -0.04 0.00 0.26 -0.08 0.00

The first substitution removes the space before and after "Atom".
The second substitution changes 5 blanks at beginning of line to 3 blanks.

wsuchem · April 21, 2013, 9:23pm

Your set of commands performs the necessary corrections perfectly. Now I need a command set that can be put into a bash script and will search through the log file and make the corrections automatically in the log file so that when it is opened it has the correct formatting.

hanson44 · April 21, 2013, 9:40pm

You can put these two lines in any shell script, and it will correct the formatting:

$ cat atoms.sh
sed -e "s/ Atom  AN/Atom AN/" -e "s/^     /   /" atoms.txt > /tmp/temp.x
mv /tmp/temp.x atoms.txt

I changed the first substitution a little to ensure that the correction is only applied once. So if the log file gets corrected once, and then new text is appended to the end, the previously corrected section will not get "corrected again".

Are you sure there are no other lines that start with five blanks? As the script is right now, any line that starts with five blanks will be switched to start with three blanks.

wsuchem · April 21, 2013, 10:12pm

Actually, there are lines throughout the log file that have five spaces prior to the data. I was wondering if there was a way to set the initial search string such as the __Atom__AN where the underscores are spaces and then successive line corrections for the atom numbers that have five spaces in front of them.

hanson44 · April 21, 2013, 10:24pm

What do the other lines look like that that have five spaces at the start? If you are referring to the data lines you already sent, there is no problem (I think), because my understanding is you want to make those lines to end up with three spaces. And that is exactly what the sed command does.

In other words, are there any lines that start with five spaces that you do NOT want to change? If so, then could you post what they look like?

There is a way to do what you ask (start the changes at the line with ATOM AN) but it is more complex, and no point doing that if not needed.

wsuchem · April 21, 2013, 10:34pm

The following are just a few examples:

Integral symmetry usage will be decided dynamically.
     1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.

 ITU= 1 0 0
     Eigenvalues --- 0.00208 0.00230 0.01092 0.01339 0.01389
     Eigenvalues --- 0.02487 0.03769 0.03800 0.04351 0.10382
     Eigenvalues --- 0.14831 0.15999 0.16000 0.16012 0.16016
     Eigenvalues --- 0.34023 0.34720 0.48005 0.53596 0.85053
     Eigenvalues --- 1000.00000

     1 Zn 0.000000
     2 O 2.267745 0.000000
     3 O 3.157518 5.392657 0.000000
     4 H 2.183471 4.414552 0.979583 0.000000
     5 C 3.739606 5.989831 1.243183 1.863198 0.000000

hanson44 · April 21, 2013, 11:03pm

Here is an alternative to preserve the other lines starting with five spaces. The following two lines can be added to any shell script. The first substitution is the same as before (_Atom__AN -> Atom_AN). The second substitution does the 5->3 blank change, starting with the Atom_AN line (nothing happens) and ending with the next line that does NOT start with a blank. "^ " means five spaces at start of line. "[^ ]" means single character that is NOT a blank.

$ cat ./atoms.sh
sed "s/ Atom  AN /Atom AN /" atoms.txt > /tmp/t1.x
sed "/Atom AN /,/^[^ ]/ {s/^     /   /}" /tmp/t1.x > atoms.txt

$ cat atoms.txt
1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
     2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
     3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
  Atom  AN X Y Z X Y Z X Y Z
     1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
     2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
     3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
Integral symmetry usage will be decided dynamically.
     1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.
     1 Zn 0.000000
     2 O 2.267745 0.000000
     3 O 3.157518 5.392657 0.000000
     4 H 2.183471 4.414552 0.979583 0.000000
     5 C 3.739606 5.989831 1.243183 1.863198 0.000000
ITU= 1 0 0
     Eigenvalues --- 0.00208 0.00230 0.01092 0.01339 0.01389
     Eigenvalues --- 0.02487 0.03769 0.03800 0.04351 0.10382
     Eigenvalues --- 0.14831 0.15999 0.16000 0.16012 0.16016
     Eigenvalues --- 0.34023 0.34720 0.48005 0.53596 0.85053
     Eigenvalues --- 1000.00000

$ ./atoms.sh

$ cat atoms.txt
1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
 Atom AN X Y Z X Y Z X Y Z
   1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
   2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
   3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
 Atom AN X Y Z X Y Z X Y Z
   1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
   2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
   3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
Integral symmetry usage will be decided dynamically.
     1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.
     1 Zn 0.000000
     2 O 2.267745 0.000000
     3 O 3.157518 5.392657 0.000000
     4 H 2.183471 4.414552 0.979583 0.000000
     5 C 3.739606 5.989831 1.243183 1.863198 0.000000
ITU= 1 0 0
     Eigenvalues --- 0.00208 0.00230 0.01092 0.01339 0.01389
     Eigenvalues --- 0.02487 0.03769 0.03800 0.04351 0.10382
     Eigenvalues --- 0.14831 0.15999 0.16000 0.16012 0.16016
     Eigenvalues --- 0.34023 0.34720 0.48005 0.53596 0.85053
     Eigenvalues --- 1000.00000

wsuchem · April 21, 2013, 11:29pm

When I put both code strings into a script file, the part of the code after 'sed' is a red font color. I was under the impression that if the code displays red then it is 'broken'? I may not understand the color schemes very clearly. The beginning of my scripts start with #!/bin/bash and when I source the script it doesn't do anything. I put the actual log file name into the code and I also tried *.log but neither produced the modified log file.

hanson44 · April 21, 2013, 11:56pm

if the code displays red then it is 'broken'?

Don't worry about the editor color schemes. Your editor doesn't know anything about sed.

when I source the script it doesn't do anything

You need to be more clear. Please copy / paste exactly what you did, with the code tags.

I also tried *.log

Do not do that. Just use the name of the log file.

neither produced the modified log file

Again, you need to copy / paste what you did. That's the only way I can tell anything.

A word of advice: Make a copy of the log file, and work on the copy. If you make a mistake, you can damage the log file.

wsuchem · April 22, 2013, 12:18am

I found the issue. I had made a mistake from the first to second line. There was supposed to be a space in front of Atom and I made the correction on the first line but not the second. I was also missing the #!. The other question I have concerns this temporary file. Is there going to be a problem with read/write access via all users other than root?

hanson44 · April 22, 2013, 12:31am

Yes, you could put rm /tmp/t1.x after using the temporary file, to allay any concerns about some file cluttering up the tmp directory.

Here is something else I would suggest. You can save a copy of the log file, say like cp atoms.log atoms.save command. Then, run the script to make the changes, create the corrected atoms.log file, and do the processing. Finally, run diff atoms.log atoms.save and scroll through the output to verify it only changed the lines you wanted it to change, that there was not some unexpected side effect.

wsuchem · April 22, 2013, 12:45am

Everything is spot on with the 'diff' between the original and corrected file. Thank you for your time and effort. Hopefully our efforts can help others with the same problem.