PDF Script to extract PDF Links MOD in Need

danielldf · May 19, 2014, 4:55pm

In here we have a script to extract all pdf links from a single page.. any idea's in how make this read instead of a page a list of pages.. and extract all pdf links ?

#!/bin/bash

# NAME:         pdflinkextractor
# AUTHOR:       Glutanimate (http://askubuntu.com/users/81372/), 2013
# LICENSE:      GNU GPL v2
# DEPENDENCIES: wget lynx
# DESCRIPTION:  extracts PDF links from websites and dumps them to the stdout and as a textfile
#               only works for links pointing to files with the ".pdf" extension
#
# USAGE:        pdflinkextractor "www.website.com"

WEBSITE="$1"

echo "Getting link list..."

lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt

# OPTIONAL
#
# DOWNLOAD PDF FILES
#
#echo "Downloading..."    
#wget -P pdflinkextractor_files/ -i pdflinks.txt

clx · May 20, 2014, 2:52am

It depends on how the tool lynx accepts the pages, I think it should accept the multiple pages as a list. Better to look for its manual.

So, your page is the "$WEBSITE" variable inside the script.
For multiple pages, you could use it like

pdflinkextractor "www.website.com" "www.anotherpage.com"

Inside the script,

WEBSITE="$@"

Incase, it doesn't accept the multiple pages,

WEBSITE="$@"
for PAGE in $WEBSITE
do
 lynx -cache=0 -dump -listonly "$PAGE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt
done

If your page list is long, I would prefer to put them in a file

$ cat mypages.txt
www.website.com
www.anotherpage.com
www.anotherpage2.com
www.anotherpage3.com

And use it like

pdflinkextractor mypages.txt

Inside script

PAGEFILE=$1
while read PAGE
do
 lynx -cache=0 -dump -listonly "$PAGE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt
done < $PAGEFILE