awk to extract multiple values from file and add two additional fields

cmccabe · October 10, 2016, 1:24pm

In the attached file I am trying to use awk to extract multiple values and create the tab-delimited desired output .
In the output R_Index is a the sequential # and Pre_Enrichment is defaulted to . .
I can extract from the values to the side of the keywords, but most are above and I can not extract those. There is most likely a better way to do this but I included my attempt as well.Thank you :).

first part of awk adds R_Index , second part od awk defaults Pre_Enrichment to . .

awk -F'\t' -v OFS='\t' '{$0=((NR==1) ? "R_Index" : (NR - 1)) OFS $0} 1' | awk -F'\t' 'NR==1{Q=NF;print} NR>1{for(i=1;i<=Q;i++){if(!$i){$i="."}};print}' OFS="\t" | awk '{for (I=1;I<=NF;I++) if ($I == "Live") {print $(I+2)};}' test.txt|

desired output ( --- do not exist in test.txt just added for clarification)

R_Index     1   --- not in test.txt added hopefully in awk
ISPLoading     84%
Pre-Enrichment     .      -- not in test.txt defaulted to .
TotalReads     75,130,408
ReadLength     203 bp
KeySignal     80
UsableSequence     61%
Enrichment     99.2%   --- this is called Live in test.txt
Polyclonal     30.0%
LowQuality     09.0%
TestFragment     88%
AlignedBases     99.1%
UnalignedBases     0.9%

RavinderSingh13 · October 10, 2016, 1:49pm

Hello cmccabe,

Could you please be more clear into your requirements, not clear like whichever conditions you need to get your expected output. Good that you are showing your attempts to us, would like to request you to let us know all conditions/requirements that you need to get your expected output.

Thanks,
R. Singh

cmccabe · October 10, 2016, 3:16pm

In the attached test.txt each one of the below $1 strings can be found and has a value above it that I am trying to include as $2 .

          (the --- are the location of the strings and values)
ISP Loading     84%      ---- row 3 $1
TotalReads     75,130,408  ---row 2 $2
ReadLength     203 bp    ---- row 3 $3[, the mean value is used
KeySignal     80     ---  row 2 $2
UsableSequence     61%  ---- row 3 $2
Polyclonal     30.0%    --- row 10 $3
LowQuality     09.0%   --- row 11 $3
TestFragment     88%   --- row 20 $3
AlignedBases     99.1%   --- row 29 $3
UnalignedBases     0.9%    ---- row 30 $3

The first portion of the awk before the first | adds R_Index in $1 and sequentially #'s it in $2 as the first row in the desired output.

The second portion of the awk after the first | is an attempt at defaulting Pre-Enrichment to . in $2 , but I am unsure of home to put that label in $1

Enrichment is called Live and has a value of 99.2% . The third portion of the awk after the | was an attempt to extract the value from test.txt . Since this is the only value that is after the keyword (not above), I think I am close.

The final output is tab-delimited and looks like this:

R_Index     1
ISP Loading     84%
Pre-Enrichment     .
Total Reads     75,130,408
Read Length     203 bp
Key Signal     80
UsableSequence     61%
Enrichment     99.2%
Polyclonal     30.0%
Low Quality     09.0%
Test Fragment     88%
Aligned Bases     99.1%
Unaligned Bases     0.9%

I hope this helps and thank you very much :).

I need to update this post as my desired output has changed. I am not in my office and it is too hard from my phone and will do so from there in about 2 hours.. Thank you :).

here is the new edit:
new desired output

R_Index ISP Loading Pre-Enrichment Total Reads Key Signal Usable Sequence Enrichment Polyclonal Low Quality Test Fragment Aligned Bases Unaligned Bases
     1 84 . 75130408 203 80 61 99.2 30 9 88 99.1 0.9

Description:
The tab-delimited output has a header row in it in row 1. These are the key words in the txt file where data is extracted or the additional two fields R_Index and Pre-Enrichment . The below is the data with each line commented only for clarification, I hope it helps and thank you :).

R_Index 1 -- sequential #
ISP Loading     84% -- % removed
Pre-Enrichment     . -- always a dot
Total Reads     75,130,408 -- commas removed
Read Length     203 bp -- bp removed
Key Signal     80 -- just extracted as is
Usable Sequence     61% -- % removed
Enrichment     99.2% -- called live in the txt % removed
Polyclonal     30.0% -- decimal and % removed
Low Quality     09.0% -- leading 0  and % removed
Test Fragment     88% -- % removed
Aligned Bases     99.1% -- decimal and % removed
Unaligned Bases     0.9% -- % removed