Extract text from html using perl or awk

I am trying to extract text after keywords fron an html file. The keywords are reportLink": , "barcodedSamples": {" , "barcodedSamples": {" . Both the perl and awk run but the output is just the entire index.html not the desired output. Also for the reportLink": only the text after the second / until the third / is needed but I do not think I accounted for that. Thank you :).

index.html

"reportLink": "/output/Home/Auto_user_S5-00580-5-Medexome_65_030/"", "status": "Completed", "timeStamp": "2016-09-01T18:32:18.000371+00:00"}], {"meta": {"limit": 20, "next": null, "offset": 0, "previous": null, "total_count": 6}, "objects": [{"barcodeId": "IonXpress", "barcodedSamples": {"MEV45": {"barcodeSampleInfo": {"IonXpress_007": {"controlSequenceType": "", "barcodedSamples": {"MEV46": {"barcodeSampleInfo": {"IonXpress_008"
perl -ne 'print if /reportLink":/ /"barcodedSamples": {"/  /{"barcodeSampleInfo": {"/' index.html > out 
awk -v RS='' '/reportLink":/ /"barcodedSamples": {"/  /{"barcodeSampleInfo": {"/' index.html > out

desired output

Auto_user_S5-00580-4-Medexome_65_30
IonXpress_007 MEV45
IonXpress_008 MEV46

We'll need to see the HTML, not just the bit you want.

1 Like

I have attached the full file as it is quite large. Thank you :).

By no stretch of the imagination your awk script will run flawlessly. If the "patterns" were connected with OR operators, and any of them would turn out TRUE, the actual line/record would be printed (the default selected by you). As your file is just ONE line/record, the entire file is printed.

1 Like

Try (as a starting point)

awk -F"[]\":{}, ]*" '
BEGIN   {for (n=split ("reportLink,barcodedSamples,barcodeSampleInfo", T); n>0; n--) SRCH[T[n]] = n
        }
        {for (i=1; i<NF; i++) if ($i in SRCH) print $(i+1)
        }

' /tmp/6784d1473958785-extract-text-html-using-perl-awk-index-html
MEV45
IonXpress_007
IonXpress_008
IonXpress_009
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/
/output/Home/Auto_user_S5-00580-5-Medexome_66_tn_031/
MEV42
IonXpress_004
IonXpress_005
IonXpress_006
/output/Home/Auto_user_S5-00580-4-Medexome_65_028/
/output/Home/Auto_user_S5-00580-4-Medexome_65_tn_029/
MEC1
IonXpress_001
IonXpress_002
IonXpress_003
/output/Home/medex60_8.13.16_027/
/output/Home/reanlzemedex60_023/
/output/Home/Auto_user_S5-00580-2-Medical_Exome_60_014/
/output/Home/Auto_user_S5-00580-2-Medical_Exome_60_tn_015/
MEC1
IonXpress_001
IonXpress_002
IonXpress_003
/output/Home/Medex59_8.11.2016_026/
/output/Home/MEDEX59_8.11-2016_025/
/output/Home/reanalyze59_8.10.16_024/
/output/Home/Auto_user_S5-00580-3-Medical_Exome_59_016/
chipDescription
/output/Home/Auto_user_S5-00580-1-IQOQ_RUN_Sample_2_51_012/
/output/Home/Auto_user_S5-00580-1-IQOQ_RUN_Sample_2_51_tn_013/
chipDescription
/output/Home/Auto_user_S5-00580-0-Test_Fragment_Run_49_010/
/output/Home/Auto_user_S5-00580-0-Test_Fragment_Run_49_tn_011/
1 Like

Thank you very much that gives me a good start :).