awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.

<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
    <dd>156 </dd>
 
    <dt>Turnaround Time:</dt>
    <dd>6-8 weeks </dd>
 
    <dt>Preferred Specimen:</dt>
    <dd>2-5 mL Blood - Lavender Top Tube </dd>
 
<dt>CPT Codes:</dt>
    <dd>81479x1</dd>
 
<ul id="clinical-utility">
    <li>Confirmation of a clinical diagnosis </li>
    <li>Differentiation between X-linked and autosomal forms of the disease </li>
    <li>Prenatal diagnosis in at-risk pregnancies</li>
 
<ol id="references">
    <li>Bal, E et al. Hum Mutat. 28:703-709, 2007.</li>
    <li>Headon et al. Nature. 414:913-916, 2001.</li>
    <li>Monreal et al. Nat Genet 22:366-369, 1999.</li>
    <li>Chassaing et al. Hum Mutat. 27(3):255-259, 2006</li>

The <�..> are not needed only the text is, if it is possible. Thanks :).

XML is not trivial. This awk parser is not perfect at it but may do.

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        sub("^.*" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

Use it like:

$ awk -f xml.awk -e 'TAGS ~ /^TITLE/ { print $2 }
        TAGS ~ /^H4/ { P=/ORDERING|BILLING|REFERENCES/ ; next }
        {       gsub(/[\r\n\t ]+/, " ", $2);
                sub(/^ $/, "", $2);
                if(P && $2) print $2 }' ORS="\n" index.html

  EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx
Test Code:
156
Turnaround Time:
6-8 weeks
Preferred Specimen:
2-5 mL Blood - Lavender Top Tube
CPT Codes:
81479x1
New York Approved:
Yes
ABN Required:
Yes
Billing Information:
View Billing Policy
ICD Codes:
757.31
: Congenital ectodermal dysplasia
*
 For price inquiries please email
zebras@genedx.com
Bal, E et al. Hum Mutat. 28:703-709, 2007.
Headon et al. Nature. 414:913-916, 2001.
Monreal et al. Nat Genet 22:366-369, 1999.
Chassaing et al. Hum Mutat. 27(3):255-259, 2006
Back To Top
Contact Us
Site Map
Terms of Service
Privacy Statement
� GeneDx
207 Perry Parkway Gaithersburg, MD 20877
Phone: +1 301 519 2100, Fax: +1 301 519 2892
Email:
genedx@genedx.com
Stay Connected:

$
1 Like

Can you post what the desired output should look like...

1 Like

In cases where you don't have quoted > characters in tags (and I didn't see any of them in your samples, but didn't do an exhaustive search in your attachment), the following much simpler script might work:

awk -F '<[^>]*>' '{$1=$1}1' OFS='' file

With the sample data you posted in the 1st message in this thread, it produces the output:

 EDAR Gene Sequencing
Test Code:
    156 
 
    Turnaround Time:
    6-8 weeks 
 
    Preferred Specimen:
    2-5 mL Blood - Lavender Top Tube 
 
CPT Codes:
    81479x1
 

    Confirmation of a clinical diagnosis 
    Differentiation between X-linked and autosomal forms of the disease 
    Prenatal diagnosis in at-risk pregnancies
 

    Bal, E et al. Hum Mutat. 28:703-709, 2007.
    Headon et al. Nature. 414:913-916, 2001.
    Monreal et al. Nat Genet 22:366-369, 1999.
    Chassaing et al. Hum Mutat. 27(3):255-259, 2006

I didn't see any problems processing your attached sample either, but due to the length (since this preserves all input lines and just removes tags), I won't post the results here. It would also be easy to get rid of empty lines after removing tags if that is what you want.

1 Like

Thank you all :).