Greping summaries of academic citations

Hello friends,
I'm trying to grep out sentences. The sentences are previous to an academic citations in a pdf. The goal is to get summaries of citable work.

Her is what I tried reading the MAN page.

pdftotext foo.pdf | grep -A 5 ***chose reg expression below*** 


pdftotext BioPsych10.pdf | grep -A 5 \([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

It pauses, but doesn't produce anything. Also it would be nice if I could stop printing at the start of the desired sentence, instead of 5 lines.

These are the regular expressions I will use.
(Daviis, 2004)

\([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

(Schultz, 2000) and (White, 1989)

\([A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\) and \(, [A-Z]*[a-z]\,[1-2][0-9][0-9][0-9]\)

(Sutter, 1987; Reid and Shapley, 1992)

\([A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\)

(Enroth-Cugell and Robson, 1966)

\([A-Z]*[a-z]\-[A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\

(Barlow, 1961, 1989; Atick and Redlich, 1990; Atick, 1992)

\([A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\; [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9]\)

(Dong and Atick, 1995a)

\([A-Z]*[a-z] and [A-Z]*[a-z]\, [1-2][0-9][0-9][0-9][a-z)\)

Thank you for taking the time to read this. Please let me know if you have any ideas.

can you post some of the output from the below command. And the required output

 
pdftotext BioPsych10.pdf 

You need to use single quotes around your regular expression to protect it from the shell.

And, you're searching for multiple uppercase letters followed by a single lowercase: [A-Z]*[a-z]

And if you are using pdftotext to produce unicode and preserve accented characters it is best to use [[:upper:]] instead of [A-Z] and [[:lower:]] instead of [a-z] , [[:alpha:]] etc.

This would yield all lines from your sample above but suppress many other text lines. If it's too open, try narrowing it down by becoming more specific, e.g. on the year numbers:

$ grep -E "([A-Za-z]+, [0-9]{4})" file
(Daviis, 2004)
(Schultz, 2000) and (White, 1989)
(Sutter, 1987; Reid and Shapley, 1992)
(Enroth-Cugell and Robson, 1966)
(Barlow, 1961, 1989; Atick and Redlich, 1990; Atick, 1992)
(Dong and Atick, 1995a)

And, yes, as Scrutinizer proposes, you may want to use the [[:upper:]] and [[:lower:]] classes.

1 Like
pdftotext BioPsych10.pdf

dl.dropbox. C O M /u/4235339/BioPsych10.txt

It won't let me post urls until I do 5 posts. It's a 2.4 MB file. Connect the .com to see it.

Is there a way to include the sentence before the citation?

Yes, instead of the -A5 option you used before try -B5. AND, read the man page.
BTW, this would not yield the sentence but the last 5 lines before. If you go for the sentence, this will become awkward...

1 Like

This regex seems to also return the required output:

grep -o "([^)]\+ [0-9]\{4\})" file

But it will fail on the citations spread on multiple lines as in:

Suggestion, first get rid of all CR before running the grep thing.

Edit:
Try this:

tr -d '\n' < file | grep -o "([^)(]\+ [0-9]\{4\})"
1 Like

I saw on anther board:
"awk '/word1/', will print out the whole sentence, when I need just a word1."

I would love to have that kind of problem. Putting in user ripat's awesome regular expression

tr -d '\n' < BioPsych10.txt | awk /'([^)(]\+ [0-9]\{4\})/'

returns nothing.

I think it is interrupting it as text and not as regular expression.

How far do you get with the -B5 option?

B5 cuts some sentences short or leaves a lot of garbage.

It was implied that I can get just the sentence and the citation using awk. Digging through the man page and google searches implied that

awk '/word1/'

spit out the whole sentence. Does anyone know if I can do that with regular expressions? I tried in my previous post.