Save files in directory as txt

cmccabe · December 12, 2014, 10:20am

 wget -x -i link.txt

The above downloads and create unique entries for the 97 links in the text file. However, each new file is saved as CM080 with a FILE extention. Is there a way to convert each file in that directory to a .txt? The 97 files are in C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu\tests.

Thank you :).

rbatte1 · December 12, 2014, 10:36am

Afterwards, you could run a rename in a loop. Assuming that the directory only contains the files you want, you can:-

cd target_directory
for file in *
do
   mv $file $file.txt
done

If the file names gets longer and/or the number of files increases, you may hit a limit on the length of the command line when * is expanded, so bear that in mind.

All files will be renamed, so if you have a1.file a2.file & already have a1.file.txt and a2.file.txt then results might be a little unpredictable. It may well work that it will rename a1.file to a1.file.txt and then rename the same file to be a1.file.txt.txt which might be very confusing, so make sure you start with an empty directory before you download the files and rename them.

I hope that this helps.

Robin

cmccabe · December 12, 2014, 10:58am

That worked great... Thank you :).

RavinderSingh13 · December 12, 2014, 11:16am

Hi cmccabe,

Following command may also help in same too.

find -maxdepth 1 -type f -name "*" -exec bash -c 'echo mv $0 ${0}".txt"' {} \;

You can remove echo if happy with the results.

Thanks,
R. Singh

junior-helper · December 12, 2014, 12:14pm

You know you have just renamed a html file to a txt file, don't you?
One can hardly convert a html file to a txt file, I mean in a way that the html tags disappear (yes, you can parse it with sed, but it's not recommended)

What do you think about this (yes, it looks complicated, but it might be a way better solution)...

In the download folder;
(Make sure there are only files downloaded from link.txt, just in case...)

awk '/pdf/ {
    gsub(/^.*href = "|".*/,"",$0)
    print FILENAME,$0 >> "/tmp/tcode-pdf.txt"
    print "http://geneticslab.emory.edu/tests/"$0 >> "/tmp/list2.txt"
}' *

The above awk

extracts the "path" to the download link for the appropriate pdf file.
creates a file tcode-pdf.txt with testcode-pdfname pairs (later, this is used in the renaming process)
generates a download list

wget -x -i /tmp/list2.txt

This time, wget will download PDFs

awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' /tmp/tcode-pdf.txt | sh

This awk command will generate commands (and execute them) to rename the cryptic filename of the pdf to testcode.pdf
E.g. test-pdf.php?testid=4125 to MM123.pdf

for i in *.pdf; do
 pdftotext "$i"
done

convert pdfs to txt files.

I've experimented with one test-code and the output looks very viable

cmccabe · December 12, 2014, 1:23pm

I am trying out your code junior-helper and have gotten to the:

 awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' tcode-pdf.txt | sh

I am getting this error:

 mv: cannot stat `test-pdf.php?testid=4405': No such file or directory
mv: cannot stat `test-pdf.php?testid=4143': No such file or directory
mv: cannot stat `test-pdf.php?testid=4432': No such file or directory
mv: cannot stat `test-pdf.php?testid=4421': No such file or directory
mv: cannot stat `test-pdf.php?testid=4415': No such file or directory
mv: cannot stat `test-pdf.php?testid=4434': No such file or directory
mv: cannot stat `test-pdf.php?testid=4391': No such file or directory

all the newly created files are in a new file path:

 C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests2\geneticslab.emory.edu\tests

but even if I do a cd to that directory I get the same error. The code seems very useful and helpful. Thank you :).

junior-helper · December 12, 2014, 3:18pm

OK, I think I know what might be the issue. In your posting #1 you said

So I'm suspecting that the new files (pdfs in this case) might have such extension too.
(Note: when I downloaded the files, neither the html files (e.g. CM080) nor the pdfs (e.g. test-pdf.php?testid=4405) had any extensions.)

If you provide the output of head -3 tcode-pdf.txt and
ls C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests2\geneticslab.emory.edu\tests | head -3 I'm sure I can tweak that awk command to behave like it was intended.

cmccabe · December 12, 2014, 3:31pm

 
head -3 tcode-pdf.txt
CM080 test-pdf.php?testid=4318
CM081 test-pdf.php?testid=4401
CM082 test-pdf.php?testid=4400

 
ls C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests2\geneticslab.emory.edu\tests | head -3
ls: cannot access C:UserscmccabeDesktoplistgeneticslab.emory.edu.txttests2geneticslab.emory.edutests: No such file or directory

Does this help? Thank you :).

junior-helper · December 12, 2014, 4:08pm

Yes and no (The output of head -3 tcode-pdf.txt helped quite a lot, it indicates that the big awk command worked as expected)

You said

Can you please cd to that directory again and run ls | head -3 and post the output?

cmccabe · December 12, 2014, 4:15pm

 
ls | head -3
tcode-pdf.txt
test-pdf.php@testid=4125
test-pdf.php@testid=4143

I manually copied over the tcode-pdf.txt as it was not there. Thanks for your help :).

I also attached tcode-pdf.txt if that helps :).

junior-helper · December 12, 2014, 4:55pm

Bullseye!!

tcode-pdf.txt
test-pdf.php@testid=4125
test-pdf.php@testid=4143

If you look at those pdf files, you'll notice how the question-mark turned to a at sign (@), it must have happend during the download, for whatever the reason. THAT is the reason why awk could not find the required/expected files...

The tcode-pdf.txt file is almost perfect.
There are 98 records, one is defective/superfluous, fix it with this command:

sed -i '/AACE/d' tcode-pdf.txt

This command will turn the question-marks to at-signs:

sed -i 's/\?/@/' tcode-pdf.txt

Now, finally, it's time to try it again:

awk '{ A[$1]=$2; next} END { for (i in A) print "mv \x27"A"\x27",i".pdf" }' tcode-pdf.txt | sh

Knock on wood!

cmccabe · December 13, 2014, 10:06am

I ran the two

sed

commands and attached tcode-pdf and list2 and here is the output of

 $ ls | head -3
CM080
CM081
CM082

The actually names appear to be different then the created files.

I get

 mv: cannot stat `test-pdf.php@testid=4421': No such file or directory
mv: cannot stat `test-pdf.php@testid=4415': No such file or directory
mv: cannot stat `test-pdf.php@testid=4434': No such file or directory
mv: cannot stat `test-pdf.php@testid=4391': No such file or directory

I tried running the

awk

commands in both directories (the original and the newly created one). Which directory should I be in? Thank you for your help :).

cmccabe · December 13, 2014, 12:40pm

It looks like the lists2.txt also needed the

sed

command to change the ? to @. So could

 sed -i 's/\?/@/' tcode-pdf.txt list2.txt

be used to convert them at the same time?

Is there a way to automatically copy tcode-pdf.txt to the newly created directory. Does this command need to be modified:

  wget -x -i /tmp/list2.txt

when I run

 for i in *.pdf; do
done>  pdftotext "$i"
> done
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found
-bash: pdftotext: command not found

I am using cygwin on windows (I know its not ideal, but its what I haave to use). Is there a command to install a package (poppler I believe) in the cygwin bin directory located here:

 C:\cygwin\bin

Thank you :).

---------- Post updated at 11:40 AM ---------- Previous update was at 09:53 AM ----------

I got it to work on a linux machine... it definatly makes it easier to parse and I like the command. I appreciate all your help and will use that command as well as another I need help on and will post on Monday. Thank you :).