Finding distinct characters from flat file

Krishanu_Saha · April 8, 2013, 12:10pm

Hi....I need one help....

I'm having a files which is having the data as follows...
a
b
c c
d d d
e
f

Now I need to find out distinct characters from this file and the output should be as follows -

a
b
c
d
e
f

Can you please help me on this? I'm using KSH script.

Thanks,
Krishanu

Yoda · April 8, 2013, 12:11pm

Is this homework? If not, can you show us what have you tried so far?

Krishanu_Saha · April 8, 2013, 12:15pm

This is not a homework. Actually I'm working in one project where I have one similar kind of file which is having list of special characters (removed from xml file) exactly in same format. Now I need to find out the records from that xml files which is having those special characters. I have done everything and now the pending item is only this. That's why I need help. Any help will be highly appreciated.

Thanks.

Yoda · April 8, 2013, 12:25pm

Using Associative Array in KSH:

#!/bin/ksh

typeset -A ARR

while read line
do
        for char in $line
        do
                (( ARR[$char] ))
        done
done < file

for k in ${!ARR[*]}
do
        print $k
done

Using awk program:

awk '{ for(i=1;i<=NF;i++) !A[$i]++ } END { for(i in A) print i } ' file

Krishanu_Saha · April 8, 2013, 1:27pm

Thanks Yoda. I've tried both options but did not get the expected result.

When I'm using the 1st option within a function, it is showing the error "typeset bad option" and when I'm using the 2nd option with nawk, it is working, but I'm getting the same list of characters in output, not distinct character.

Is there any options?

Corona688 · April 8, 2013, 1:29pm

I think awk needs the -F"" option otherwise it will split on whitespace, not characters.

Krishanu_Saha · April 8, 2013, 1:52pm

somehow the "typeset -A ARR" is not working in script, every time it is displaying the error message "typeset bad option(s)". Can anyone please help to find out the distinct character list?

Thanks,
Krishanu

ahamed101 · April 8, 2013, 2:42pm

Yoda's awk solution works for me. What do you mean you are not getting the expected result?

[root@Imperfecto_1 ~]# cat infile
a
b
c c
d d d g
e
f
[root@Imperfecto_1 ~]# awk '{ for(i=1;i<=NF;i++){a[$i]} } END{for (i in a) {print i}}' infile
a
b
c
d
e
f
g

And for typeset error, try typeset -a ARR

man typeset
-a     Each name is an array variable (see Arrays above).

I used bash though, not sure about ksh

--ahamed

Yoda · April 8, 2013, 2:49pm

Please note that -A option used to define associative arrays is available only on modern KSH version. The option -a is used to define indexed arrays.

From KSH manual:

-A     Declares vname to be an associative array.
-a     Declares vname to be an indexed array.

Krishanu_Saha · April 8, 2013, 3:09pm

Hello All....the nawk is now working for me also. Thank you guys.

But do not know why the typeset option is not working for me. I have been trying with all options even tried with declare option.

Thank you Guys....

Corona688 · April 8, 2013, 3:21pm

You probably have an old version of ksh that lacks this feature.

Krishanu_Saha · April 8, 2013, 10:37pm

Hello All....1 quick question. If the file contains the character list in following way then the suggested awk command is working fine. Now if there is no space bewteen 2 characters in single line (for eg. CC instead of C C) then this command is not working. Any help if there is no space between 2 chars then how do I get the same result from this string?

Character List:

a
b
c c
d d d g
e
f

Yoda · April 8, 2013, 11:14pm

awk ' {
        v = $0
        gsub (/[ \t]*/, x, v)
        for ( i = 1; i <= length(v); i++ )
        {
                d = substr ( v, i, 1 )
                A[d]
        }
} END {
        for ( k in A )
                print k
} ' file

hanson44 · April 8, 2013, 11:14pm

$ cat input.txt
a
b hi
cc
d d d g
 e
f

$ sed "s/\(.\)/\1\n/g" input.txt | sort | uniq | grep -v "^ *$"
a
b
c
d
e
f
g
h
i

Krishanu_Saha · April 8, 2013, 11:32pm

Thanks a lot Yoda....the awk code is working fine and showing me the exact result what I'm expecting. Thanks a lot.

alister · April 8, 2013, 11:51pm

A few thoughts regarding your solution ...

Most sed implementations do not recognize the "\n" escape sequence in replacement text. Typically, a backslash followed by a newline is required.

The \1 backreference isn't necessary, since the entirety of the matching text is all that's needed and sed already makes that available via &.

You don't need to use uniq. sort -u will do the job.

grep isn't necessary. Since you are already using sed, you can use it to delete blanks at the start.

An untested variation of your approach:

sed 's/ //g; /./!d; s/./&\
/g; s/.$//' file | sort -u

Equally untested code:

tr -d ' \b\t\r' < file | fold -w1 | sort -u

Regards,
Alister

hanson44 · April 9, 2013, 12:39am

Thank you for the thoughtful suggestions.

Corona688 · April 9, 2013, 11:50am

The backslash would prevent a literal newline from being fed into sed. Leave it off.

$ echo "a\
b"

ab

$ echo "a
b"

a
b

$

alister · April 9, 2013, 12:06pm

You are mistaken. To portably insert a newline in sed replacement text, a backslash-newline pair is required.

From POSIX sed:

What you are demonstrating is sh line continuation. This is unrelated to the requirements of a sed script.

Regards,
Alister