Using find in a directory containing large number of files

shoaibjameel123 · August 8, 2011, 1:05am

Hi All,

I have searched this forum for related posts but could not find one that fits mine. I have a shell script which removes all the XML tags including the text inside the tags from some 4 million XML files.

The shell script looks like this (MODIFIED):

find . "*.xml" -print | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done

Previously, the shell script looked like this (ORIGINAL):

ls -1 *.xml | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done

Since, ls gives "Argument list too long" message, so after searching through this forum I could do some modifications to my ORIGINAL to come up with the MODIFIED version (above). But the MODIFIED version does not seem to work.

guruprasadpr · August 8, 2011, 1:09am

Hi

"ls -1 " and "find . " are not the same. find will get files from folders within the current directory as well, if it finds any.

Guru.

shoaibjameel123 · August 8, 2011, 1:47am

Thanks. Is there any workaround to handle 4 million files in a directory in Linux? Many posts here point out to using xargs. Let me try that if it works I'll post my code here.

---------- Post updated at 01:47 PM ---------- Previous update was at 01:12 PM ----------

ok, so as of now this is what I have done:

echo *.xml | xargs ls -1 | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done

When I run

echo *.xml | xargs ls -1

I can see the list of files. But the .txt files that I am getting are all empty.

ravi_san · August 8, 2011, 5:02am

Hi

You can try this code for finding large number of files in a directory.

for page in `find scripting |grep -e 'xml$'`; 
do 
  cat $page | sed -e 's/<.*>//g' $page>$page.txt_3; 
done

In the above code, "scripting" is the location of the directory

And this code cat $page | sed -e 's/<.*>//g' $page>$page.txt
As you said it removes all the XML tags including the text inside the tags in all the XML files.So obviously the output text files will be empty.

shoaibjameel123 · August 8, 2011, 5:12am

Thanks. Sorry, I guess I was a bit vague here. When I wrote

I meant the script deletes contents inside the tags like

<text>

 <?xml version="1.0" encoding="iso-8859-1" ?>

This means my script removes only the above tags including all the text inside the tags (like "text" and "?xml version="1.0" encoding="iso-8859-1" ?") and keeps the main paragraphs of the files.

---------- Post updated at 05:12 PM ---------- Previous update was at 05:09 PM ----------

Oh great!

You've pointed out one more fault. It is indeed deleting everything. This I can fix myself.

panyam · August 8, 2011, 5:37am

 
Is this correct?
 
cat $page | sed -e 's/<.*>//g' $page>$page.txt

It should be 
 
sed -e 's/<.*>//g' $page>$page.txt

shoaibjameel123 · August 8, 2011, 5:39am

Yes it can also be your way. But my way works too.