Greping summaries of academic citations

danbroz · November 9, 2012, 2:18am

Hello friends,
I'm trying to grep out sentences. The sentences are previous to an academic citations in a pdf. The goal is to get summaries of citable work.

Her is what I tried reading the MAN page.

pdftotext foo.pdf | grep -A 5 ***chose reg expression below*** 


pdftotext BioPsych10.pdf | grep -A 5 \([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

It pauses, but doesn't produce anything. Also it would be nice if I could stop printing at the start of the desired sentence, instead of 5 lines.

These are the regular expressions I will use.
(Daviis, 2004)

\([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

(Schultz, 2000) and (White, 1989)

\([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\) and \(, [A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

(Sutter, 1987; Reid and Shapley, 1992)

\([A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\)

(Enroth-Cugell and Robson, 1966)

\([A-Z]*[a-z]\-[A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\

(Barlow, 1961, 1989; Atick and Redlich, 1990; Atick, 1992)

\([A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\)

(Dong and Atick, 1995a)

\([A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9][a-z)\)

Thank you for taking the time to read this. Please let me know if you have any ideas.

itkamaraj · November 9, 2012, 2:34am

can you post some of the output from the below command. And the required output

 
pdftotext BioPsych10.pdf

Scrutinizer · November 9, 2012, 3:38am

You need to use single quotes around your regular expression to protect it from the shell.

RudiC · November 9, 2012, 3:57am

And, you're searching for multiple uppercase letters followed by a single lowercase: [A-Z]*[a-z]

Scrutinizer · November 9, 2012, 4:05am

And if you are using pdftotext to produce unicode and preserve accented characters it is best to use [[:upper:]] instead of [A-Z] and [[:lower:]] instead of [a-z] , [[:alpha:]] etc.

RudiC · November 9, 2012, 4:08am

This would yield all lines from your sample above but suppress many other text lines. If it's too open, try narrowing it down by becoming more specific, e.g. on the year numbers:

$ grep -E "([A-Za-z]+, [0-9]{4})" file
(Daviis, 2004)
(Schultz, 2000) and (White, 1989)
(Sutter, 1987; Reid and Shapley, 1992)
(Enroth-Cugell and Robson, 1966)
(Barlow, 1961, 1989; Atick and Redlich, 1990; Atick, 1992)
(Dong and Atick, 1995a)

And, yes, as Scrutinizer proposes, you may want to use the [[:upper:]] and [[:lower:]] classes.

danbroz · November 9, 2012, 11:25pm

pdftotext BioPsych10.pdf

dl.dropbox. C O M /u/4235339/BioPsych10.txt

It won't let me post urls until I do 5 posts. It's a 2.4 MB file. Connect the .com to see it.

danbroz · November 11, 2012, 1:05am

Is there a way to include the sentence before the citation?

RudiC · November 11, 2012, 2:33am

Yes, instead of the -A5 option you used before try -B5. AND, read the man page.
BTW, this would not yield the sentence but the last 5 lines before. If you go for the sentence, this will become awkward...

ripat · November 11, 2012, 3:08am

This regex seems to also return the required output:

grep -o "([^)]\+ [0-9]\{4\})" file

But it will fail on the citations spread on multiple lines as in:

Suggestion, first get rid of all CR before running the grep thing.

Edit:
Try this:

tr -d '\n' < file | grep -o "([^)(]\+ [0-9]\{4\})"

danbroz · November 12, 2012, 5:29am

I saw on anther board:
"awk '/word1/', will print out the whole sentence, when I need just a word1."

I would love to have that kind of problem. Putting in user ripat's awesome regular expression

tr -d '\n' < BioPsych10.txt | awk /'([^)(]\+ [0-9]\{4\})/'

returns nothing.

I think it is interrupting it as text and not as regular expression.

RudiC · November 12, 2012, 8:07am

How far do you get with the -B5 option?

danbroz · November 12, 2012, 9:03am

B5 cuts some sentences short or leaves a lot of garbage.

It was implied that I can get just the sentence and the citation using awk. Digging through the man page and google searches implied that

awk '/word1/'

spit out the whole sentence. Does anyone know if I can do that with regular expressions? I tried in my previous post.