I have been searching and trying to come up with an awk
that will perform the following on a
converted text file (original is a pdf).
1. Since the first two lines are (begin with) text they are removed
2. if $1 is a number then all text is merged (combined) into one line until the next number in $1. There might be no lines until the next number, or 1 line, 2 lines, etc. The amount of lines is variable but what is constant is the number in $1.
3. Since the last 3 lines are (begin with) text they are removed
I added a awk
script attempt with description as well. Thank you :).
file
TIER 1 MOLECULAR PATHOLOGY PROCEDURES
The following codes represent gene-specific and genomic procedures.
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitor
resistance), gene analysis, variants in the kinase domain
81200
ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous polyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene
analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease)
gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint,
qualitative or quantitative
CPT codes and descriptions only �2016 American Medical Association. All rights reserved.
CCI Comp. Code
Non-specific Procedure
desired output
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitorresistance), gene analysis, variants in the kinase domain
81200 ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous p
olyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease) gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint, qualitative or quantitative
awk
awk '$0==($0+0) { # remove lines that do not start with a number
if ( $1 ~ /^[0-9]$/ ) # if $1 is a number
if(l){print l;l=$0} { # print line
else{l=l" "$0}}END{print l} # if $1 is not a number combine line(l) until next number and print
}
}' file