The above downloads and create unique entries for the 97 links in the text file. However, each new file is saved as CM080 with a FILE extention. Is there a way to convert each file in that directory to a .txt? The 97 files are in C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu\tests.
Afterwards, you could run a rename in a loop. Assuming that the directory only contains the files you want, you can:-
cd target_directory
for file in *
do
mv $file $file.txt
done
If the file names gets longer and/or the number of files increases, you may hit a limit on the length of the command line when * is expanded, so bear that in mind.
All files will be renamed, so if you have a1.file a2.file & already have a1.file.txt and a2.file.txt then results might be a little unpredictable. It may well work that it will rename a1.file to a1.file.txt and then rename the same file to be a1.file.txt.txt which might be very confusing, so make sure you start with an empty directory before you download the files and rename them.
You know you have just renamed a html file to a txt file, don't you?
One can hardly convert a html file to a txt file, I mean in a way that the html tags disappear (yes, you can parse it with sed, but it's not recommended)
What do you think about this (yes, it looks complicated, but it might be a way better solution)...
In the download folder;
(Make sure there are only files downloaded from link.txt, just in case...)
extracts the "path" to the download link for the appropriate pdf file.
creates a file tcode-pdf.txt with testcode-pdfname pairs (later, this is used in the renaming process)
generates a download list
wget -x -i /tmp/list2.txt
This time, wget will download PDFs
awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' /tmp/tcode-pdf.txt | sh
This awk command will generate commands (and execute them) to rename the cryptic filename of the pdf to testcode.pdf
E.g. test-pdf.php?testid=4125 to MM123.pdf
for i in *.pdf; do
pdftotext "$i"
done
convert pdfs to txt files.
I've experimented with one test-code and the output looks very viable
I am trying out your code junior-helper and have gotten to the:
awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' tcode-pdf.txt | sh
I am getting this error:
mv: cannot stat `test-pdf.php?testid=4405': No such file or directory
mv: cannot stat `test-pdf.php?testid=4143': No such file or directory
mv: cannot stat `test-pdf.php?testid=4432': No such file or directory
mv: cannot stat `test-pdf.php?testid=4421': No such file or directory
mv: cannot stat `test-pdf.php?testid=4415': No such file or directory
mv: cannot stat `test-pdf.php?testid=4434': No such file or directory
mv: cannot stat `test-pdf.php?testid=4391': No such file or directory
all the newly created files are in a new file path:
OK, I think I know what might be the issue. In your posting #1 you said
So I'm suspecting that the new files (pdfs in this case) might have such extension too.
(Note: when I downloaded the files, neither the html files (e.g. CM080) nor the pdfs (e.g. test-pdf.php?testid=4405) had any extensions.)
If you provide the output of head -3 tcode-pdf.txt and ls C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests2\geneticslab.emory.edu\tests | head -3 I'm sure I can tweak that awk command to behave like it was intended.
head -3 tcode-pdf.txt
CM080 test-pdf.php?testid=4318
CM081 test-pdf.php?testid=4401
CM082 test-pdf.php?testid=4400
ls C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests2\geneticslab.emory.edu\tests | head -3
ls: cannot access C:UserscmccabeDesktoplistgeneticslab.emory.edu.txttests2geneticslab.emory.edutests: No such file or directory
If you look at those pdf files, you'll notice how the question-mark turned to a at sign (@), it must have happend during the download, for whatever the reason. THAT is the reason why awk could not find the required/expected files...
The tcode-pdf.txt file is almost perfect.
There are 98 records, one is defective/superfluous, fix it with this command:
sed -i '/AACE/d' tcode-pdf.txt
This command will turn the question-marks to at-signs:
sed -i 's/\?/@/' tcode-pdf.txt
Now, finally, it's time to try it again:
awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' tcode-pdf.txt | sh
commands and attached tcode-pdf and list2 and here is the output of
$ ls | head -3
CM080
CM081
CM082
The actually names appear to be different then the created files.
I get
mv: cannot stat `test-pdf.php@testid=4421': No such file or directory
mv: cannot stat `test-pdf.php@testid=4415': No such file or directory
mv: cannot stat `test-pdf.php@testid=4434': No such file or directory
mv: cannot stat `test-pdf.php@testid=4391': No such file or directory
I tried running the
awk
commands in both directories (the original and the newly created one). Which directory should I be in? Thank you for your help :).
Is there a way to automatically copy tcode-pdf.txt to the newly created directory. Does this command need to be modified:
wget -x -i /tmp/list2.txt
when I run
for i in *.pdf; do
done> pdftotext "$i"
> done
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found
I am using cygwin on windows (I know its not ideal, but its what I haave to use). Is there a command to install a package (poppler I believe) in the cygwin bin directory located here:
C:\cygwin\bin
Thank you :).
---------- Post updated at 11:40 AM ---------- Previous update was at 09:53 AM ----------
I got it to work on a linux machine... it definatly makes it easier to parse and I like the command. I appreciate all your help and will use that command as well as another I need help on and will post on Monday. Thank you :).