.PDF and .TXT to .XML. Is it possible?

Hi!

I need to realize this task.
In folder i have such files:
name1.txt
name1.pdf
name2.txt
name2.pdf
etc...

I want to scan this folder, match files with same name (name1.txt with name1.pdf, name2.txt with name2.pdf) and create files name1.xml and name2.xml, based on it. i.e:
i want to create .xml file with such structure:
<file>
<text>...</text>
<pdfcontent>...</pdfcontent>
</file>

,where between tags <text> I must put content of nameX.txt files;
and between tags <pdfcontent> I must put binary code of name nameX.pdf (base64 type or smth like it).

Thanks.

Okay, find finds all pdf or txt files in the current directory, awk strips off the leading and trailing dots turning ./filename.txt into /filename , then sort -u gets rid of any duplicates in that list. From there you can feed it into the shell one by one then match "./${BASE}".* which will turn into ./filename.* and match only that group of files, which you loop through in turn and do what you want with. Each time you do that, redirect the output into the new xml file.

find . -maxdepth 1 -type f -iname '*.txt' -o -iname '*.pdf' |
        awk -v FS="." '{ print $2 }' | sort -u |
while read BASE
do
        ( echo "<file>"

        for FILE in "./${BASE}".*
        do
        case "$FILE" in
        *.pdf)
            printf "<pdfcontent>"
            openssl base64 < "$FILE"
            printf "</pdfcontent>\n"
            ;;
        *.txt)
            printf "<text>"
            cat "$FILE"
            printf "</text>\n"
            ;;
        *) # Do nothing for files of the wrong type, i.e. .xml
            ;;
        esac
        echo "</file>" ) > "./${BASE}".xml
done
#!/bin/bash

dir=<some directory>

cd $dir
for ea_file in `ls *.txt`
do
    fname=`echo ${ea_name} | awk 'BEGIN{FS="."}{print $1}'`
    if [ -f ${ea_file}.pdf ]; then
       echo "<file>" > ${ea_file}.xml
       echo "<text>${ea_file}.txt</text>" >> ${ea_file}.xml
       echo "<pdf>${ea_file}.pdf</pdf>" >> ${ea_file}.xml
     else
       echo "WARNING: The file ${ea_file}.txt did not find cooresponding ${ea_file}.pdf"
     fi
done

Corona688,
Sorry, but I get next warning

alex@alex:~/123$ . test.sh
bash: test.sh: string 23: Syntax error: word unexpected (expecting ")")
bash: test.sh: string 23: `        echo "</file>" ) > "./${BASE}".xml'

dajon
Thank you, but your script puts only names of files into .xml file. But I need to put content of .txt and .pdf files into .xml

Not sure where that went wrong, this one runs

find . -maxdepth 1 -type f -iname '*.txt' -o -iname '*.pdf' |
        awk -v FS="." '{ print $2 }' | sort -u |
while read BASE
do
        ( echo "<file>"
        for FILE in "./${BASE}".*
        do
                case "$FILE" in
                *.txt)
                        printf "<text>"
                        cat "$FILE"
                        echo "</text>"
                        ;;
                *.pdf)
                        printf "<pdfcontent>"
                        openssl base64 < "$FILE"
                        echo "</pdfcontent>"
                        ;;
                *)
                        ;;
                esac
        done
        echo "</file>"

        ) > "./${BASE}.xml"
done
1 Like

how about this?

ls -l |awk -F'[\. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a==2) print "<file>\n<text>"i".txt<\/text>\n<pdfcontent>"i".pdf<\/pdfcontent>\n<\/file>" >i".xml"}'

dajon and yinyuemi, actual file contents required between <text> </text> and base64 encode of file between <pdfcontent> and </pdfcontent>

Corona688, CDATA escaping will need to be done, because text file may contain "<" and "&" and these are illegal in XML data blocks.

optik77 - wonder if it would be better to base64 encode the text file too?

for pdffile in *.pdf
do
   txtfile=${pdffile%.txt}.txt
   xmlfile=${pdffile%.txt}.xml
   if [ -f $pdffie ] && [ -f $txtfile ]
   then
        printf '<file>\n<text><![CDATA['
        sed 's/]]>/] ]>/g' "$txtfile"
        printf ']]></text>\n<pdfcontent>'
        openssl base64 < "$pdffile"
        echo "</pdfcontent>"
        echo "</file>"
    fi > "$xmlfile"
done
for filename in `ls -l |awk -F'[\. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a==2) print i}'`
do
echo -e "<file>\n<text>" `cat $filename.txt` "</text>\n<pdfcontent>"  `openssl base64 -in $filename.pdf` "</pdfcontent>\n</file>\n"  >$filename.xml
done
 

@yinyuemi, Few things here:

  • large number of files will blow the for command line
  • XML will fail to parse if txt file contains "<" or "&"
  • large txt or pdf file will blow command line of echo command

Good Points, Thanks Chubler_XL.
I have no any idea about how XML parsing, so I followed you code,
how about this? please let me know as usual if any problem:)

for filename in `ls -l |awk -F'[. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a==2) print i}'`
do
echo -e "<file>\n<text><![CDATA[" `sed 's/]]>/] ]>/g' $filename.txt ` "]]></text>\n<pdfcontent>"  `openssl base64 -in $filename.pdf` "</pdfcontent>\n</file>\n"  >$filename.xml
done

Ruby(1.9+)

#!/usr/bin/env ruby  

# xml template
xml=<<EOF
<file>
<text>%s</text>
<pdfcontent>%s</pdfcontent>
</file>
EOF

Dir["*.txt"].each do |file|
    filename=file.sub(/\.txt$/,"")
    pdf = filename+".pdf"
    xmlfile = filename+".xml"
    if File.exists?( pdf )
        w = sprintf( xml , file, pdf )
        File.open(xmlfile,"w").write(w)
    end
end

kurumi, how does that generate the base64 encode of the pdf file?

well i missed that out didn't i ? :slight_smile:

to generate base64,

require 'base64'

# xml template
xml=<<EOF
<file>
<text>%s</text>
<pdfcontent>%s</pdfcontent>
</file>
EOF

Dir["*.txt"].each do |file|
    filename=file.sub(/\.txt$/,"")
    pdf = filename+".pdf"
    xmlfile = filename+".xml"
    if File.exists?( pdf )
        b4=Base64.encode64( File.open(pdf).read )
        w = sprintf( xml , file, b4 )
        File.open(xmlfile,"w").write(w)
    end
end

---------- Post updated at 10:21 PM ---------- Previous update was at 10:18 PM ----------

one problem i see is the listing of files using ls -l. A simple shell expansion will do. No need to use ls -l

1 Like

Thanks kurumi, It's nice to see a Ruby script that's more than a 1 liner.