Hello,
I have some text data that is in the form of multi-line records. Each record ends with the string $$$$
and the next record starts on the next line.
RDKit 2D
15 14 0 0 0 0 0 0 0 0999 V2000
5.4596 2.1267 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5214 0.6279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2543 -0.1749 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3161 -1.6737 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0491 -2.4765 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.9255 0.5209 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.6585 -0.2819 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.3296 0.4139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9374 -0.3889 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.2662 0.3069 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.3280 1.8057 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5333 -0.4959 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.8621 0.1999 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.1291 -0.6029 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-7.4580 0.0929 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
3 6 1 0
6 7 1 0
7 8 1 0
8 9 1 0
9 10 1 0
10 11 1 0
10 12 1 0
12 13 1 0
13 14 1 0
14 15 1 0
M END
> <id>
1
> <name>
N1-(2-ethylbutyl)hexane-1,3,6-triamine
> <ID>
118903148
$$$$
What I need to do is to find the value from the name field and copy it to the first line of the record. I the case above, I would pass "name" to the script and the script would find the value on the line after > <name>
and write it to the first line of the record.
N1-(2-ethylbutyl)hexane-1,3,6-triamine
RDKit 2D
15 14 0 0 0 0 0 0 0 0999 V2000
5.4596 2.1267 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5214 0.6279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2543 -0.1749 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3161 -1.6737 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0491 -2.4765 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.9255 0.5209 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.6585 -0.2819 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.3296 0.4139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9374 -0.3889 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.2662 0.3069 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.3280 1.8057 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5333 -0.4959 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.8621 0.1999 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.1291 -0.6029 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-7.4580 0.0929 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
3 6 1 0
6 7 1 0
7 8 1 0
8 9 1 0
9 10 1 0
10 11 1 0
10 12 1 0
12 13 1 0
13 14 1 0
14 15 1 0
M END
> <id>
1
> <name>
N1-(2-ethylbutyl)hexane-1,3,6-triamine
> <ID>
118903148
$$$$
I wrote the script below to do that and it does work. It takes about 30 seconds to process a file of 500 records and that is a bit slow.
This script reads through the input file adding each row to an array until $$$$ is found. Along the way, it is checking each line to see if it is > <name>
. If it is, the next line is saved.
When the end of the record is reached, the name is printed to the output file and then the lines of data that were stored in the array. The first line in the array is skipped to bypass writing the blank line at the start of the record. The $$$$
is also added. The array and name are cleared and the next record is processed.
#!/bin/bash
# input file name
input_file=$1
# attribute field tag to use for name line
name_field=$2
# output file name
output_file=$3
# create empty output file
touch $output_file
# declare array for individual sdf record
declare -a sdf_record
# create both possible versions of attribute tag value
name_string_1='> <'$name_field'>'
name_string_2='> <'$name_field'>'
# initalize
temp_name=''; save_next_line='0'
# set input field separator to space to preserve spaces
IFS=''
# loop through input file
while read line; do
# test if line is last line of record, if not add line to temp record array
if [ "$line" != "\$\$\$\$" ]; then
# add each line to sdf record
sdf_record=("${sdf_record[@]}" "$line")
# check if this line has been marked to save for the name string
if [ "$save_next_line" == "1" ]; then
# save name and reset indicator
temp_name=$line; save_next_line='0'
# check if this is the name tag line, check all three versions of tagging
elif [[ "$line" == "$name_string_1" ]] || [[ "$line" == "$name_string_2" ]]; then
# set marked to collect the next line for the name string
save_next_line='1'
fi
# when the $$$$ record terminator is reached, print the record adding the name line
else
# add the record termination string $$$$ as the last line of the temp record
sdf_record=("${sdf_record[@]}" "\$\$\$\$")
# add the name field to the start of the record
echo -e $temp_name >> $output_file
# append the rest of the record lines stored in the array to the output file
# this skips the first line which is replace by the name above
for record_line in "${sdf_record[@]:1}"
do
echo -e $record_line >> $output_file
done
# clear the current sdf record and name
unset sdf_record; temp_name=''
fi
done < $input_file
It is possible that there could already be a name on the first line and the solution above takes care of that. This also allows for any available field to be used for the "name". At this point, it doesn't trap the case if the name field is not found.
As with most of the things I write on my own, it works but is very slow.
Any suggestions that could speed this up, make it more sound, etc, would be very much appreciated.
LMHmedchem