Explain iconv command

eskay · March 18, 2019, 2:48pm

I have a requirement to remove all non-ascii characters from a fixed length file. I used the below command which is removing special characters but somehow the total record length is being truncated to one space less. If it is a multi-byte string then many characters at the end are being truncated.

 cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt

vbe · March 18, 2019, 3:53pm

Does not surprise me...
iconv purpose is not to remove char when not in a supposed charset, but to convert one charset to another...
How can you assume non ascii char in UTF8 like you are doing?
And if UTF8, yes converted to ASCII record length will will not be the same ASCII is sure shorter...

jim_mcnamara · March 18, 2019, 4:16pm

What is the output of the file command, i.e.

file non-ascii.txt

eskay · March 18, 2019, 4:38pm

UTF-8 Unicode text, with very long lines

Corona688 · March 18, 2019, 4:46pm

changing utf-8 to non-utf-8 may change the length as utf-8 has variable-length characters.

For that matter, not sure how much sense a 'fixed length' UTF-8 record makes. It can be N bytes and fewer characters.

eskay · March 18, 2019, 5:03pm

But with iconv function Ñ is converted to N.. but I need to add a space at the end as it is a multi-byte character to not change the length. Is it possible?

Corona688 · March 18, 2019, 5:25pm

How are these records terminated?

Corona688 · March 18, 2019, 6:32pm

dd can pad newline-terminated records to appropriate lengths with spaces, but they must be newline-terminated. Here is a utility to convert to newline-terminated records:

// convert to an executable with cc block.c -o block
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        long int BLOCK=0;
        char *buf=NULL;

        if(argc < 2)
        {
                fprintf(stderr, "usage:  %s blocksize [inputfile]\n", argv[0]);
                exit(1);
        }

        if(argc >= 2) BLOCK=strtol(argv[1], NULL, 10);
        if(argc >= 3) {
                if(freopen(argv[2], "rb", stdin)==NULL)
                {
                        perror("Couldn't open");
                        return(1);
                }
        }

        if(BLOCK<=0)
        {
                fprintf(stderr, "Invalid block size %ld\n", BLOCK);
                return(1);
        }

        buf=malloc(BLOCK+1);
        if(buf == NULL)
        {
                perror("Couldn't allocate");
                return(1);
        }

//      fprintf(stderr, "block size %ld\n", BLOCK);

        buf[BLOCK]='\n';
        while(fread(buf,BLOCK,1,stdin))
                fwrite(buf,BLOCK+1,1,stdout);

        free(buf);
        return(0);
}

Having done that, you can convert it with iconv, pad to the required length with dd, then remove the newlines with tr.

In my example, I delete capital letters from aAabBbcccdDdeeEfffggghhhiIijjjkkklLlmmmnnn without changing length.

#!/bin/sh

BLOCKSIZE=3

# Add newlines using block utility
./block $BLOCKSIZE |
        # Remove characters, i.e. you'd put iconv here
        tr -d 'A-Z' |
        # Pad shrunken records with spaces
        dd ibs=$BLOCKSIZE cbs=$BLOCKSIZE conv=sync,block 2>/dev/null |
        # Remove newlines
        tr -d '\n'

$ cat fixed.txt
aAabBbcccdDdeeEfffggghhhiIijjjkkklLlmmmnnn

$ ./dump.sh < fixed.txt
aa bb cccdd ee fffggghhhii jjjkkkll mmmnnn   

$

You end up with one blank record at the end, full of spaces, which I haven't figured out how to avoid yet.

RudiC · March 19, 2019, 6:10am

Not sure I understand correctly - you want to iconv multi-byte UTF-8 records to ASCII but retain record length? So - if a two byte representation (like Ñ ) is converted to N , a space should be added, and for three bytes, two spaces, to keep the record length, PROVIDED the target representations is a one byte char. This doesn't always come true, e.g € -> EUR in ASCII.
If above assumption is true, some conditioning upfront the iconv might help, like

LC_ALL=C sed 's/[\xC0-\xDF]./& /g; s/[\xE0-\xEF]../& /g' non-ascii.txt | iconv -futf8 -tASCII//TRANSLIT//IGNORE

adds one space for two byte repr., two for three byte repr. For longer / more exotic codes, it must be expanded equivalently.