iconv and xmllint

Here is my question,

volume of records processed : 5M ( approx )

Its basically very simple operation that am trying to do and I had achieved the output that am interested. What am looking for really is to improve the performance, an optimized way to do that.

with respect to iconv, am checking each of the records whether they can be converted from encoding format ' ef1 ' to another encoding format ' ef2 '.
For this, I take one record at a time and apply ' iconv ' command. With the return value ' $? ' its validated whether it could be converted to another encoding format or not

similarly with respect to xmllint, am creating a xml file and validating against the XSD's conformance.

As said earlier, these are simple operations and need your thoughts / input to improve the efficiency.

Is there a better way of doing these operations when the volume of records is really huge ( 5M ) ?

Invoking iconv on a single large attempt is far more efficient than many small ones. You can see from this:

admin@np64gw:/dev/shm$ time perl -e 'while (<>) { open(ICONV, "| iconv -f big5 -t utf8 >/dev/null"); print ICONV $_; close ICONV }' <XLink.txt

real    0m4.224s
user    0m2.200s
sys     0m0.652s
admin@np64gw:/dev/shm$ time iconv -f big5 -t utf8 XLink.txt >/dev/null          
real    0m0.009s
user    0m0.008s
sys     0m0.000s

So, if you have some methods to concatenate the records into 1 single file before passing to iconv, it will go a lot faster. iconv will return the file position that has the error, so if you have some indexing performed that allows you to accurate map a file position to record number, that would likely work. If you are just doing validation and expect all records should pass normally, this may work for you.

But can you reprogram that part of the script in C? I guess with libiconv you can better control the process in case there are many alien bytes sneaked in.

thanks for the reply,

thats a nice idea,

so to achieve that I should work on arriving at a map between range of character positions and the record number

But there is a potential problem to this approach

say, there are ' n ' records

if iconv is failing at ' 3 ' record ( 3 < n )
then 3rd record should be removed from processing and continue with the 4th record, until the 3rd record is removed it would not continue from where it had failed

so each time when a 'x' record fails it should be removed from processing 'n' records

Yes, that's why it is good if you are doing validation and normally would expect everything to pass.

This shortcut will be quite messy otherwise, if indeed some records have problems. That's why I have another suggestion of using libiconv, as I know you can instruct it to ignore bytes that cannot be converted and proceed, and do so without stopping the iconv process. This cannot be achieved with the iconv executable alone because there are no "hooks" that allow you do so from the command line.

Loading of character tables is very expensive operation, so starting iconv many times is bound to be slow. If you really have records of that volume, you should really invest in a C program with libiconv that acts on a concatenated sequence of records. I have some good feeling that it could work based on my earlier exploration of libiconv although I have not made anything similar myself.