I have searched this forum for related posts but could not find one that fits mine. I have a shell script which removes all the XML tags including the text inside the tags from some 4 million XML files.
The shell script looks like this (MODIFIED):
find . "*.xml" -print | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done
Previously, the shell script looked like this (ORIGINAL):
ls -1 *.xml | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done
Since, ls gives "Argument list too long" message, so after searching through this forum I could do some modifications to my ORIGINAL to come up with the MODIFIED version (above). But the MODIFIED version does not seem to work.
Thanks. Is there any workaround to handle 4 million files in a directory in Linux? Many posts here point out to using xargs. Let me try that if it works I'll post my code here.
---------- Post updated at 01:47 PM ---------- Previous update was at 01:12 PM ----------
ok, so as of now this is what I have done:
echo *.xml | xargs ls -1 | while read page
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt
done
When I run
echo *.xml | xargs ls -1
I can see the list of files. But the .txt files that I am getting are all empty.
You can try this code for finding large number of files in a directory.
for page in `find scripting |grep -e 'xml$'`;
do
cat $page | sed -e 's/<.*>//g' $page>$page.txt_3;
done
In the above code, "scripting" is the location of the directory
And this code cat $page | sed -e 's/<.*>//g' $page>$page.txt
As you said it removes all the XML tags including the text inside the tags in all the XML files.So obviously the output text files will be empty.
Thanks. Sorry, I guess I was a bit vague here. When I wrote
I meant the script deletes contents inside the tags like
<text>
<?xml version="1.0" encoding="iso-8859-1" ?>
This means my script removes only the above tags including all the text inside the tags (like "text" and "?xml version="1.0" encoding="iso-8859-1" ?") and keeps the main paragraphs of the files.
---------- Post updated at 05:12 PM ---------- Previous update was at 05:09 PM ----------
Oh great!
You've pointed out one more fault. It is indeed deleting everything. This I can fix myself.