Parse text file using specific tags

cmccabe · December 11, 2014, 1:09pm

 awk -F "[<>]" '/<href=>|<href=>|<top>|<top>/ {print $3, OFS=\t}' source.txt > output.txt

I'm not quite sure how to parse the attached file, but what I am trying to do is in a output file have the link (href=), name (after the <), and count (<top>) in 3 separate columns.

My attempt is the above script and an output.txt is created but it is empty.

The desired output is:

http://geneticslab.emory.edu/tests/MM021     Autism Spectrum Disorders     61
http://geneticslab.emory.edu/tests/MM250     Brain Malformations     50

Thank you :).

RudiC · December 11, 2014, 2:09pm

Try

sed 'N; s/\n/\t/; s/href="/>/; s/<[^>]*>//g; s/">/\t/g; s/[ -]*&#[0-9]*;[ -]*//g; /^[\t]*$/d' /tmp/source.txt
http://geneticslab.emory.edu/tests/MM021    Autism Spectrum Disorders    61
http://geneticslab.emory.edu/tests/MM250    Brain Malformations    50
http://geneticslab.emory.edu/tests/MCAR1    Comprehensive Cardiovascular    106
.
.
.

Not sure how to avoid the last five entries' disorder due to lengthy font/line-height info.

cmccabe · December 11, 2014, 4:38pm

Thank you :).