Hello,
I have some large text files that look like,
putrescine
Mrv1583 01041713302D
6 5 0 0 0 0 999 V2000
2.0928 -0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.6650 0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.5217 -0.2063 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2361 0.2063 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8072 0.2063 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9504 -0.2063 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
3 4 1 0 0 0 0
3 5 1 0 0 0 0
4 6 1 0 0 0 0
M END
> <num>
1
> <name>
putrescine
$$$$
bis(hexamethylene)triamine.mol
Mrv1583 01041713302D
15 14 0 0 0 0 999 V2000
6.4898 1.0450 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
7.2042 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9187 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.6332 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.3477 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.0621 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.7766 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.4911 1.4575 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
12.2055 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
12.9200 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.6345 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
14.3490 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.0634 1.0450 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.7779 1.4575 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
16.4924 1.0450 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
3 4 1 0 0 0 0
4 5 1 0 0 0 0
5 6 1 0 0 0 0
6 7 1 0 0 0 0
7 8 1 0 0 0 0
8 9 1 0 0 0 0
9 10 1 0 0 0 0
10 11 1 0 0 0 0
11 12 1 0 0 0 0
12 13 1 0 0 0 0
13 14 1 0 0 0 0
14 15 1 0 0 0 0
M END
> <num>
2
> <name>
bis(hexamethylene)triamine
$$$$
There can be thousands of records and there is no specific length for each record as far as the number of lines or tag fields between MEND and $$$$. Each record ends with the $$$$ terminator. I am trying to divide large files into a number of smaller files, each with the same number of records.
This code attempts to do this,
#! /bin/sh
# input file name
input_file=${1:-input.txt}
# output file name
output_file=${2:-output.txt}
# number of compounds per sdf file
split_number=${3:-6}
cat $input_file | \
awk -v split=$split_number ' { OUT[++CNT] = $0; }
$0 == "$$$$" { ++MOLS }
$MOLS == $split { for(i in OUT) print OUT; delete OUT; MOLS = 0 }
END { for(i in OUT) print OUT }
' > $output_file
by storing rows in OUT[] until a counter is reached (the desired number of records in each subfile) and then printing the rows, clearing the array, and resetting the counter. This also attempts to trap if EOF is reached before the counter reaches the set number.
The obvious problem is that there is no way to change the output file name for each subsequent write, so I will only end up with the last file. I think I can change the value of $output_file with the awk code but I think the awk here runs in a different subshell than bash, so I don't think that will work.
If I could run the awk only on specific lines of the file, I think I could call awk from a bash loop and make that work but I am guessing there is an easier way. I am running this in 32-bit cygwin so have everything available from that kit.
Suggestions would be appreciated.
LMHmedchem