Our university has upgraded its version of a computational chemistry program that our group uses quite regularly. In the past we have been able to extract frequency spectra from log files that are generated. Since the upgrade, the viewing program errors out. I've been able to trace down the changes between the old and new log file formats. The new program adds two extra spaces to the following lines:
Atom AN X Y Z X Y Z X Y Z
1 30 0.00 0.00 0.03 0.09 -0.02 0.04 0.10 -0.01 -0.03
2 8 0.03 -0.01 -0.12 -0.14 -0.08 -0.06 -0.14 -0.05 0.06
One space before Atom and one of the spaces between AN and Atom needs removed as there are two and should be only one space prior to each. On the next line there are five spaces and there should only be three before the atom number or the number starting the second and each successive line. The log file has each frequency broken up into columns of three frequencies so depending on how complex the molecular system is, this can become an impossible job to complete on thousands of files. Once the format is corrected, then everything opens fine and the frequencies can be extracted. The software developer is aware of this issue but advises the generation of a special portable file that would normally be used to transport data across platforms. The above data does have its tabs removed between each column as when I copied and pasted the contents, its formatting was removed. There are multiple instances of the above lines that then have a varied number of atoms listed below each column of three and the entire data set ends with a blank line where the thermochemistry data starts. Since these job files are processed via a bash script to extra thermodynamics and electronic energies, I thought it would be fairly simple to incorporate any new commands into the torque execution script. Any help would be greatly appreciated.
The following was copied and pasted directly out of the log file but all of the tabs and spaces were removed in transit for some reason. I put the data in the code blocks but I added the spaces that I need edited, if the whole data block needs reformatted manually let me know.
1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
Atom AN X Y Z X Y Z X Y Z
1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
4 1 0.00 0.00 -0.07 -0.12 -0.07 0.00 0.40 -0.10 0.00
5 6 0.00 0.00 -0.19 0.00 0.24 0.00 -0.24 -0.14 0.00
6 1 0.00 0.00 -0.23 0.32 0.36 0.00 0.14 0.14 0.00
7 1 0.00 0.00 -0.89 0.21 0.30 0.00 -0.58 -0.24 0.00
8 1 0.00 0.00 0.12 -0.15 0.39 0.00 0.02 -0.40 0.00
9 1 0.00 0.00 0.18 0.53 -0.04 0.00 0.26 -0.08 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
Atom AN X Y Z X Y Z X Y Z
1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
4 1 -0.85 0.21 0.00 0.00 0.00 -0.29 0.00 0.00 0.09
5 6 -0.14 0.01 0.00 0.00 0.00 -0.04 0.00 0.00 0.00
6 1 -0.04 0.02 0.00 0.00 0.00 -0.75 0.00 0.00 0.01
7 1 -0.01 0.05 0.00 0.00 0.00 0.31 0.00 0.00 -0.27
8 1 -0.25 0.12 0.00 0.00 0.00 -0.46 0.00 0.00 0.27
9 1 -0.15 0.19 0.00 0.00 0.00 -0.18 0.00 0.00 -0.92
They are all spaces. There aren't any tabs that I can find. I have been grabbing the text with WinSCP's internal editor. Wordpad also shows everthything aligned via spaces.
Your set of commands performs the necessary corrections perfectly. Now I need a command set that can be put into a bash script and will search through the log file and make the corrections automatically in the log file so that when it is opened it has the correct formatting.
You can put these two lines in any shell script, and it will correct the formatting:
$ cat atoms.sh
sed -e "s/ Atom AN/Atom AN/" -e "s/^ / /" atoms.txt > /tmp/temp.x
mv /tmp/temp.x atoms.txt
I changed the first substitution a little to ensure that the correction is only applied once. So if the log file gets corrected once, and then new text is appended to the end, the previously corrected section will not get "corrected again".
Are you sure there are no other lines that start with five blanks? As the script is right now, any line that starts with five blanks will be switched to start with three blanks.
Actually, there are lines throughout the log file that have five spaces prior to the data. I was wondering if there was a way to set the initial search string such as the __Atom__AN where the underscores are spaces and then successive line corrections for the atom numbers that have five spaces in front of them.
What do the other lines look like that that have five spaces at the start? If you are referring to the data lines you already sent, there is no problem (I think), because my understanding is you want to make those lines to end up with three spaces. And that is exactly what the sed command does.
In other words, are there any lines that start with five spaces that you do NOT want to change? If so, then could you post what they look like?
There is a way to do what you ask (start the changes at the line with ATOM AN) but it is more complex, and no point doing that if not needed.
Integral symmetry usage will be decided dynamically.
1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.
Here is an alternative to preserve the other lines starting with five spaces. The following two lines can be added to any shell script. The first substitution is the same as before (_Atom__AN -> Atom_AN). The second substitution does the 5->3 blank change, starting with the Atom_AN line (nothing happens) and ending with the next line that does NOT start with a blank. "^ " means five spaces at start of line. "[^ ]" means single character that is NOT a blank.
$ cat ./atoms.sh
sed "s/ Atom AN /Atom AN /" atoms.txt > /tmp/t1.x
sed "/Atom AN /,/^[^ ]/ {s/^ / /}" /tmp/t1.x > atoms.txt
$ cat atoms.txt
1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
Atom AN X Y Z X Y Z X Y Z
1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
Atom AN X Y Z X Y Z X Y Z
1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
Integral symmetry usage will be decided dynamically.
1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.
1 Zn 0.000000
2 O 2.267745 0.000000
3 O 3.157518 5.392657 0.000000
4 H 2.183471 4.414552 0.979583 0.000000
5 C 3.739606 5.989831 1.243183 1.863198 0.000000
ITU= 1 0 0
Eigenvalues --- 0.00208 0.00230 0.01092 0.01339 0.01389
Eigenvalues --- 0.02487 0.03769 0.03800 0.04351 0.10382
Eigenvalues --- 0.14831 0.15999 0.16000 0.16012 0.16016
Eigenvalues --- 0.34023 0.34720 0.48005 0.53596 0.85053
Eigenvalues --- 1000.00000
$ ./atoms.sh
$ cat atoms.txt
1 2 3
A A A
Frequencies -- 73.6186 95.0148 177.9910
Red. masses -- 2.5506 3.7026 3.3055
Frc consts -- 0.0081 0.0197 0.0617
IR Inten -- 7.9374 9.9457 8.1890
Atom AN X Y Z X Y Z X Y Z
1 30 0.00 0.00 0.02 -0.07 -0.08 0.00 0.04 -0.02 0.00
2 8 0.00 0.00 -0.11 0.28 0.09 0.00 0.12 -0.01 0.00
3 8 0.00 0.00 0.25 -0.06 0.00 0.00 -0.13 0.24 0.00
4 5 6
A A A
Frequencies -- 231.0559 251.4928 255.6673
Red. masses -- 2.8839 1.1192 1.0754
Frc consts -- 0.0907 0.0417 0.0414
IR Inten -- 82.8162 113.2879 160.2404
Atom AN X Y Z X Y Z X Y Z
1 30 0.10 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
2 8 -0.03 0.13 0.00 0.00 0.00 0.06 0.00 0.00 0.06
3 8 -0.18 -0.13 0.00 0.00 0.00 0.05 0.00 0.00 0.01
Integral symmetry usage will be decided dynamically.
1530128 words used for storage of precomputed grid.
Keep R1 ints in memory in canonical form, NReq=44454758.
1 Zn 0.000000
2 O 2.267745 0.000000
3 O 3.157518 5.392657 0.000000
4 H 2.183471 4.414552 0.979583 0.000000
5 C 3.739606 5.989831 1.243183 1.863198 0.000000
ITU= 1 0 0
Eigenvalues --- 0.00208 0.00230 0.01092 0.01339 0.01389
Eigenvalues --- 0.02487 0.03769 0.03800 0.04351 0.10382
Eigenvalues --- 0.14831 0.15999 0.16000 0.16012 0.16016
Eigenvalues --- 0.34023 0.34720 0.48005 0.53596 0.85053
Eigenvalues --- 1000.00000
When I put both code strings into a script file, the part of the code after 'sed' is a red font color. I was under the impression that if the code displays red then it is 'broken'? I may not understand the color schemes very clearly. The beginning of my scripts start with #!/bin/bash and when I source the script it doesn't do anything. I put the actual log file name into the code and I also tried *.log but neither produced the modified log file.
I found the issue. I had made a mistake from the first to second line. There was supposed to be a space in front of Atom and I made the correction on the first line but not the second. I was also missing the #!. The other question I have concerns this temporary file. Is there going to be a problem with read/write access via all users other than root?
Yes, you could put rm /tmp/t1.x after using the temporary file, to allay any concerns about some file cluttering up the tmp directory.
Here is something else I would suggest. You can save a copy of the log file, say like cp atoms.log atoms.save command. Then, run the script to make the changes, create the corrected atoms.log file, and do the processing. Finally, run diff atoms.log atoms.save and scroll through the output to verify it only changed the lines you wanted it to change, that there was not some unexpected side effect.
Everything is spot on with the 'diff' between the original and corrected file. Thank you for your time and effort. Hopefully our efforts can help others with the same problem.