Adding lines to a large file

LMHmedchem · September 20, 2014, 1:30pm

Hello,

I have a relatively large text file (25,000K) consisting of records of data. For each record, I need to create a new line based on what is already there.

Every record has a block that looks like,

M  END
>  <ID>
1

>  <SOURCE>
KEGG

>  <SOURCE_ID>
C00002

>  <NAME>
ATP; Adenosine 5'-triphosphate

>  <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O

>  <MIMW>
506.995745159

>  <FORMULA>
C10H16N5O13P3

$$$$

The data tag lines > <ID> , etc, are the same for each record (or should be). The data on the line below the tag varies. I need to make a new field called

> <SOURCE_SOURCE_ID>

That is the data from > <SOURCE> concatenated with > <SOURCE_ID> separated with an underscore.

The record above would look like,

M  END
>  <ID>
1

>  <SOURCE>
KEGG

>  <SOURCE_ID>
C00002

>  <SOURCE_SOURCE_ID>
KEGG_C00002

>  <NAME>
ATP; Adenosine 5'-triphosphate

>  <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O

>  <MIMW>
506.995745159

>  <FORMULA>
C10H16N5O13P3

$$$$

This is quite a bit beyond the things I normally do with shell scripts and I'm not sure where to start. I presume this would be some kind of while read line that looks for > <SOURCE> and captures the next line, looks for > <SOURCE_ID> and captures the next line, makes up the new variable, and makes an insert. All other lines would just be printed. This seems like manipulating an output stream, which I know how to do in cpp, but not in bash.

Suggestions would be greatly appreciated.

LMHmedchem

Akshay_Hegde · September 20, 2014, 1:49pm

$  awk '!f{f=/>[ \t]+<SOURCE>/}!s{s=/>[ \t]+<SOURCE_ID>/} f && s && !NF {print insert; f=s=""}1' insert="\n>  <SOURCE_SOURCE_ID>\nKEGG_C00002"  file

pilnet101 · September 20, 2014, 2:41pm

Try this to:

awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS= file

LMHmedchem · September 20, 2014, 3:21pm

I decided to try this first. I ran this from the command line adding my file name at the end.

awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=KHHscaffolds_7108.sdf

I let it run for a while and it doesn't seem to do anything. There is no change to the file KHHscaffolds_7108.sdf and no output to the terminal. Should I be redirecting to a new file or something like that? Is it just taking a long time to run since it is processing the entire file in one pass?

I also tried,

awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=file KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf

and this finishes quickly, but the output file is the same as the input.

LMHmedchem

---------- Post updated at 03:21 PM ---------- Previous update was at 03:12 PM ----------

Alright, after reading a bit about awk RS, I see the meaning of RS=

This is the correct usage,

awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=  KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf

The empty space after RS= fooled me a bit there. This worked very well. I am always amazed at how fast these things can work, even on a large file. I am sure that would have taken me a few hundred lines in cpp and I doubt it would have run nearly as fast.

LMHmedchem

pilnet101 · September 20, 2014, 3:54pm

Glad it helped

Yep that is why I love awk!