Hi all,
I am working on an extremely large collection of text data (about 2 million XML files) in a directory. I have changed the extension from .xml to .dat. Right now I am using this code to remove the XML tags, but the code is way too slow. It seems that it is taking fore-ever:
#ls -1 *.dat | while read page
find . -name "*.dat" -print | while read page
do
links -dump $page>$page.txt
done
Just to let the readers know that the commented line with ls does not even work as it gives Argument list too long message.
Then I modified the code, and came up with this:
#ls -1 *.dat | while read page
#find . -name "*.dat" -print | while read page
num=1
for page in *.dat;
do
links -dump $page>$page.txt
let num=num+1
done
Just wish to know will this speed up my task? What I want to do is that instead of doing ls or find , I should generate the filename using my code, and the program should then process that file which has been automatically generated. The trick that I have used is that I have re-named all the 2 million files with "contiguous" numbers 1.dat, 2.dat, 3.dat, 4.dat and so on without leaving any number in between and then using a counter, I generate those numbers and read those files.
Or, Is there any other better way to fasten up my task? I am using Linux with BASH.