Hello,
I have a relatively large text file (25,000K) consisting of records of data. For each record, I need to create a new line based on what is already there.
Every record has a block that looks like,
M END
> <ID>
1
> <SOURCE>
KEGG
> <SOURCE_ID>
C00002
> <NAME>
ATP; Adenosine 5'-triphosphate
> <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O
> <MIMW>
506.995745159
> <FORMULA>
C10H16N5O13P3
$$$$
The data tag lines > <ID>
, etc, are the same for each record (or should be). The data on the line below the tag varies. I need to make a new field called
> <SOURCE_SOURCE_ID>
That is the data from > <SOURCE>
concatenated with > <SOURCE_ID>
separated with an underscore.
The record above would look like,
M END
> <ID>
1
> <SOURCE>
KEGG
> <SOURCE_ID>
C00002
> <SOURCE_SOURCE_ID>
KEGG_C00002
> <NAME>
ATP; Adenosine 5'-triphosphate
> <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O
> <MIMW>
506.995745159
> <FORMULA>
C10H16N5O13P3
$$$$
This is quite a bit beyond the things I normally do with shell scripts and I'm not sure where to start. I presume this would be some kind of while read line that looks for > <SOURCE>
and captures the next line, looks for > <SOURCE_ID>
and captures the next line, makes up the new variable, and makes an insert. All other lines would just be printed. This seems like manipulating an output stream, which I know how to do in cpp, but not in bash.
Suggestions would be greatly appreciated.
LMHmedchem
$ awk '!f{f=/>[ \t]+<SOURCE>/}!s{s=/>[ \t]+<SOURCE_ID>/} f && s && !NF {print insert; f=s=""}1' insert="\n> <SOURCE_SOURCE_ID>\nKEGG_C00002" file
1 Like
Try this to:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n> <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS= file
1 Like
I decided to try this first. I ran this from the command line adding my file name at the end.
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n> <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=KHHscaffolds_7108.sdf
I let it run for a while and it doesn't seem to do anything. There is no change to the file KHHscaffolds_7108.sdf and no output to the terminal. Should I be redirecting to a new file or something like that? Is it just taking a long time to run since it is processing the entire file in one pass?
I also tried,
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n> <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=file KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf
and this finishes quickly, but the output file is the same as the input.
LMHmedchem
---------- Post updated at 03:21 PM ---------- Previous update was at 03:12 PM ----------
Alright, after reading a bit about awk RS, I see the meaning of RS=
This is the correct usage,
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n> <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS= KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf
The empty space after RS= fooled me a bit there. This worked very well. I am always amazed at how fast these things can work, even on a large file. I am sure that would have taken me a few hundred lines in cpp and I doubt it would have run nearly as fast.
LMHmedchem
Glad it helped
Yep that is why I love awk!