match string exactly with awk/sed

euval · May 12, 2011, 5:02am

Hi all,

I have a list that I would like to parse with awk/sed. The list is contains entries such as:

JournalTitle: Biochemistry
JournalTitle: Biochemistry and cell biology = Biochimie et biologie cellulaire
JournalTitle: Biochemistry and experimental biology
JournalTitle: Biochemistry and molecular biology education : a bimonthly publication of the International Union of Biochemistry and Molecular Biology
JournalTitle: Biochemistry and molecular biology international
JournalTitle: Biochemistry. Biokhimiia
JournalTitle: Biochemistry international
JournalTitle: Biochemistry research international
JournalTitle: Comparative biochemistry and physiology. Biochemistry and molecular biology
JournalTitle: Comparative biochemistry and physiology. Part B, Biochemistry & molecular biology
JournalTitle: Doklady. Biochemistry and biophysics
JournalTitle: Doklady biochemistry : proceedings of the Academy of Sciences of the USSR, Biochemistry section / translated from Russian
JournalTitle: Life sciences. Pt. 2: Biochemistry, general and molecular biology
JournalTitle: The Journal of experimental zoology. Supplement : published under auspices of the American Society of Zoologists and the Division of Comparative Physiology and Biochemistry / the Wistar Institute of Anatomy and Biology

If I want to search for "Biochemistry", I would like it to return this entry only and not any other combinations:

JournalTitle: Biochemistry

At present what I have is:

awk '/JournalTitle:/&&/Biochemistry/' J_Medline.txt | awk -F ":" '{print $0}'

but that does not give the desired result (due to my ignorance of awk syntax). Suggestions much appreciated!

Franklin52 · May 12, 2011, 5:09am

Something like this?

awk '/JournalTitle: Biochemistry/ && NF==2' J_Medline.txt

euval · May 12, 2011, 5:19am

Thanks for replying. Yes that works, but my issue is a bit a deeper - the regular expression in awk will be supplied from a variable within a script, so it may be "Biochemistry" or "Biochemistry and cell biology". I need awk to return exact match each time, and I have no way of knowing what number of fields are going to be in the regular expression for matching using your one-liner (thinking about this - maybe this can be calculated using echo/wc and then passed into awk expression?) Any ideas/thoughts would be gratefully received!

Franklin52 · May 12, 2011, 5:28am

You can do something like this:

regex="Biochemistry and cell biology" 
awk -v var="$regex" '$0==var' J_Medline.txt

sidorenko · May 12, 2011, 5:45am

Do you want to output exactly the substring matched by your regexp? I.e. should the whole input line be printed if it fully matches your regexp and has no more characters outside of match or may an input line simply contain your regexp along with other characters but you wish to print only the matched part?

euval · May 12, 2011, 6:20am

sidorenko - "Do you want to output exactly the substring matched by your regexp?"

Yes - that is exactly what I need. Any ideas?

sidorenko · May 12, 2011, 6:24am

>gawk -v re="hi baby" 'match($0,re){print substr($0,RSTART,RLENGTH)}'
ahasdf
ahsshi babyshsd
hi baby

replace "hi baby" with the variable you want to pass