I am using cygwin under windows but also run under opensuse 13.2.
This is the entire script and it is run with something like,
./_reformat.sh input_file output_file CompoundName Identifier InChI= 015_
It looks for certain conditions and when found, makes some modifications to the record. Most of the files I am processing contain thousands to tens of thousands of records. This is the version that writes each line to the output file as processed, the slow version.
#!/bin/sh
# file to be processed
input_file=$1
# prefix to add to firstline
output_file=$2
# sdf tag with name field
name_tag=$3
# sdf tag with substitution field
sub_tag=$4
# string to check for on line following name tag line
check_for=$5
# prefix to add to firstline
prefix=$6
# create output file
touch $output_file
# location of line to replace with modified name
replace_line=0
# value collected to build replacement name
sub_value=''
# flag to check next line
check_next=0
# flag to do replacement
replace=0
# flag to indicating saving of next line for sub name
save_next=0
# initalize line counter
i=0
# to preserve spaces
IFS=""
# read file by lines
while read line
do
# store line in array
line_array[$i]="$line"
# increment counter
i=$((i+1))
# if check next was set to 1 above, the next line is the one that needs to be evaluated
if [[ $check_next == "1" ]]; then
# reset check next, do this here so we reset even if the next line is not a match
check_next=0
# check for check_for as part of line
if [[ $line =~ .*$check_for.* ]]; then
# save line number
replace_line=$i
# set flag to do replacement of name
replace=1
fi
fi
# find name tag line and check if value on next line includes check_for string
# check for name_tag as part of line
if [[ $line =~ .*$name_tag.* ]]; then
# set flag to check next line
check_next=1
fi
# save the value in the line after sub tag has been found
# this must come before save_next is set
if [[ $save_next == "1" ]]; then
# save the value from this line to use for substitute name
sub_value=$line
# reset flag
save_next=0
fi
# look for the line with the sub tag
if [[ $line =~ .*$sub_tag.* ]]; then
# set flag to save next line
save_next=1
fi
# when we get to the end of the record
if [[ $line == '$$$$' ]]; then
# if replace has been set, make replacements
if [[ $replace== "1" ]]; then
# create new first line value from stored substitute value
new_firstline=$prefix'PubChem_CID_'$sub_value
# create new name value from stored substitute value
new_name='PubChem_CID_'$sub_value
# decrement replace line value by one
replace_line=$(($replace_line-1))
# decrement line counter value by one
i=$(($i-1))
# loop through stored file
for ((j=0; j <= $i ; j++)) ; do
# for the first line, add the new firstline value
if [[ $j == "0" ]]; then
echo $new_firstline >> $output_file
# when the replace line is found, use the substitute value
elif [[ $j == "$replace_line" ]]; then
echo $new_name >> $output_file
# output all other lines as normal
else
echo ${line_array[$j]} >> $output_file
fi
done
# if replace is not set, output unmodified record
else
for ((j=0; j < $i ; j++)) ; do
echo ${line_array[$j]} >> $output_file
done
fi
# reset for next record
# line array
unset line_array
# line counter
i=0
# location of line to replace with modified name
replace_line=0
# value collected to build replacement name
sub_value=''
# flag to check next line
check_next=0
# flag to do replacement
replace=0
# flag to indicating saving of next line for sub name
save_next=0
fi
done < $input_file
This is an example of input with one record that meets the conditions to be changed,
015_InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2
OpenBabel05051721102D
24 28 0 0 0 0 0 0 0 0999 V2000
-1.3288 3.5365 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.3006 2.0368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.4974 1.1324 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.9702 1.4167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.9528 0.2833 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.4626 -1.1343 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9897 -1.4185 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.0071 -0.2852 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.5074 -0.2570 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.5170 -1.3527 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.9781 -1.0133 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0026 -2.1090 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4637 -1.7696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9004 -0.3346 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.3615 0.0048 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
6.7981 1.4398 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.3859 -1.0909 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
3.8759 0.7611 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4148 0.4217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3904 1.5174 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0708 1.1780 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-5.6593 -2.0386 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-6.8892 -1.1799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.4525 0.2552 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
3 4 2 0 0 0 0
4 5 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
6 22 1 0 0 0 0
7 8 2 0 0 0 0
8 9 1 0 0 0 0
8 3 1 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 2 0 0 0 0
12 13 1 0 0 0 0
13 14 2 0 0 0 0
14 15 1 0 0 0 0
14 18 1 0 0 0 0
15 16 2 0 0 0 0
15 17 1 0 0 0 0
18 19 2 0 0 0 0
19 20 1 0 0 0 0
19 11 1 0 0 0 0
20 21 1 0 0 0 0
21 2 1 0 0 0 0
21 9 1 0 0 0 0
22 23 1 0 0 0 0
23 24 1 0 0 0 0
24 5 1 0 0 0 0
M CHG 2 15 1 17 -1
M END
> <order>
281
> <CompoundName>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2
> <Identifier>
101651482
> <InChI>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2
> <InChIKey>
ZIOMULGFTPCQIY-UHFFFAOYSA-N
> <MolecularFormula>
C16H9N3O5
> <MonoisotopicMass>
323.0542
> <SMILES>
C1C2=C(C=CC(=C2)[N+](=O)[O-])N=C3N1C(=O)C4=CC5=C(C=C43)OCO5
$$$$
I was trying to dump the lines of the file to a new array with the code I first posted, but that didn't work.
In short, when the value on the line after <CompoundName>
contains InChI=
, the name value is too long for some of the tools in the chain. I address this by making a new name from the value read from the line following <Identifier>
and re-write the record using the substitution name in the required places. If the line following <CompoundName>
does not contain InChI=
, then the record is written unmodified.
This is what the properly modified version of the record would look like,
015_PubChem_CID_101651482
OpenBabel05051721102D
24 28 0 0 0 0 0 0 0 0999 V2000
-1.3288 3.5365 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.3006 2.0368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.4974 1.1324 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.9702 1.4167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.9528 0.2833 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.4626 -1.1343 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9897 -1.4185 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.0071 -0.2852 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.5074 -0.2570 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.5170 -1.3527 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.9781 -1.0133 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0026 -2.1090 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4637 -1.7696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9004 -0.3346 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.3615 0.0048 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
6.7981 1.4398 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.3859 -1.0909 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
3.8759 0.7611 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4148 0.4217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3904 1.5174 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0708 1.1780 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-5.6593 -2.0386 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-6.8892 -1.1799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.4525 0.2552 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
3 4 2 0 0 0 0
4 5 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
6 22 1 0 0 0 0
7 8 2 0 0 0 0
8 9 1 0 0 0 0
8 3 1 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 2 0 0 0 0
12 13 1 0 0 0 0
13 14 2 0 0 0 0
14 15 1 0 0 0 0
14 18 1 0 0 0 0
15 16 2 0 0 0 0
15 17 1 0 0 0 0
18 19 2 0 0 0 0
19 20 1 0 0 0 0
19 11 1 0 0 0 0
20 21 1 0 0 0 0
21 2 1 0 0 0 0
21 9 1 0 0 0 0
22 23 1 0 0 0 0
23 24 1 0 0 0 0
24 5 1 0 0 0 0
M CHG 2 15 1 17 -1
M END
> <order>
281
> <CompoundName>
PubChem_CID_101651482
> <Identifier>
101651482
> <InChI>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2
> <InChIKey>
ZIOMULGFTPCQIY-UHFFFAOYSA-N
> <MolecularFormula>
C16H9N3O5
> <MonoisotopicMass>
323.0542
> <SMILES>
C1C2=C(C=CC(=C2)[N+](=O)[O-])N=C3N1C(=O)C4=CC5=C(C=C43)OCO5
$$$$
Sorry for the overly long post. I was trying to solve this myself and thought I just made some syntax error in making a copy of the array.
LMHmedchem