A beginner needing some help programming documents

nomadblue · January 31, 2013, 10:53pm

Hi all,

I'm a fairly new beginner with shell programming and python programming. I have a mac (mountain lion OS 10.8.2) and use the terminal for programming. I'm trying to use the unix to easily organize some language data that I am working with. Basically I have to word lists, that I need to combine into one.

Word list 1 (Chinese):

Word List 2 (Chinese pinyin with numerical tone mark):

ni3men
hao3 
Jia1ming2
ni3
hao3

My desired outcome would combing the numbers from the second wordlist with the characters in the first word list to look like this:

,3,,0
,3 
,1,,2
,3
,3

It is important that the format is "character," comma, "number"

So far I have done the following with wordlist two:

tr '[:alpha:]' ',' <WordList2.txt | tr -s ',' >WordList2B.txt 
paste WordList1.txt Wordlist2B.txt > CombinedWordList.txt
tr -d '\t' <CombinedWordList.txt | tr -s '[:space:]' >CombinedWordList2.txt

My current output document looks like this:

,3,
,3
,1,2
,3
,3

It is 'almost' there - but the first and third need to be further 'integrated' so the format is 'character' comma 'number 'character' comma 'number'. So every single Chinese symbol should be followed by a number. One additional problem, is that some words (such as the second character in the first example
(,3,) do not have a corresponding number - in this case I would like it to automatically insert a zero '0' - so the first word would appear ",3,,0". So specifically - I need help:
1) formatting the document to appear "character" comma "number", "character" comma "number instead of "character" "character" comma "number" comma "number"
2) Having a zero '0' inserted after the comma when there is not already a number.

Any help or suggestions would be greatly appreciated

RudiC · February 1, 2013, 5:26am

This is the first time I have to struggle with UTF>8 chars, so I'm feeling a bit overstrained, and you should take my proposal as a mere direction indicator. On top, both your input files have trailing blanks that I removed. If they are needed, you have to insert special action into the code. Here's my meek approach:

awk    'NR==FNR {sub(/[^0-9]$/, "&0");gsub (/[0-9]/,",&,");  Ar[NR]=$2$4; next}
     {gsub (/.../,"&,"); $1=$1","substr (Ar[FNR],1,1); if ($2) $2=$2","substr (Ar[FNR],2,1)}
     1
    ' FS="," OFS="," file2 file3
,3,,0,
,3,
,1,,2,
,3,
,3,

The trailing commas are due to the insufficient attempt to separate chinese syllables which I didn't bother to remove - I'm sure you have better means in your locale!

Don_Cragun · February 3, 2013, 4:46pm

rudic:

This is the first time I have to struggle with UTF>8 chars, so I'm feeling a bit overstrained, and you should take my proposal as a mere direction indicator. On top, both your input files have trailing blanks that I removed. If they are needed, you have to insert special action into the code. Here's my meek approach:
awk    'NR==FNR {sub(/[^0-9]$/, "&0");gsub (/[0-9]/,",&,");  Ar[NR]=$2$4; next}
   {gsub (/.../,"&,"); $1=$1","substr (Ar[FNR],1,1); if ($2) $2=$2","substr (Ar[FNR],2,1)}
   1
   ' FS="," OFS="," file2 file3
,3,,0,
,3,
,1,,2,
,3,
,3,
The trailing commas are due to the insufficient attempt to separate chinese syllables which I didn't bother to remove - I'm sure you have better means in your locale!

Nomadblue,
RudiC's code looks reasonable, but I haven't been able to test it. I have found that awk on OS X Version 10.7.5 (Lion) counts bytes instead of counting characters when using substr() and length() and that using a regular expression to search for a space fails if the space follows a multibyte character (not just in awk; but also at least in bash, ed, ex, grep, ksh, sed, and vi). My testing was done with LANG set to en_US.UTF-8 and no LC_* environment variables set.

I would love to hear if this has been fixed in Mountain Lion.

************************
Update: I take back what I said about REs not matching spaces after multibyte characters. The characters that I originally thought were spaces were multibyte characters consisting of the octal byte sequences: 0343 0200 0200 and 0342 0200 0206. Those two characters aren't spaces, but they are in the locale's space character class.

---------- Post updated Feb 3rd, 2013 at 13:46 ---------- Previous update was Feb 2nd, 2013 at 23:13 ----------

The following script seems to do what you want except that it does not print any trailing space character class characters at the ends of the output lines. (Note that Word list 1 had a trailing character in the space character class on lines 3 and 5, Word list 2 on lines 2 and 3, and your desired outcome on lines 2 and 3. The output produced by this script does not include any characters in the space character class.)

#!/bin/ksh
# The awk on Mac OS X Version 10.7.5 does not meet POSIX/UNIX requirements for
# handing multibyte characters (it processes bytes instead of characters) at
# least in the length() and substr() functions.  This problem should be easy to
# handle in awk, but this script is written entirely as a ksh script which does
# handle multibyte characters correctly.  (The bash on OS X Version 10.7.5 also
# handles multibyte characters correctly and, although this script uses many
# features that are not defined by the standards, this script works both with
# ksh and bash on OS X.  If using this script on another system, you will need
# to use a 1993 or later version of ksh.)

# Read chinese string.
while IFS="" read -r c
do      # Read corresponding Chinese pinyin string with tone marks.
        IFS="" read -r cp <&3
        # Strip a trailing space character class character from each string, if
        # there is one.
        c=${c%[[:space:]]}
        cp=${cp%[[:space:]]}
        # Is there a tone mark at the end of the Chinese pinyin string?
        if [[ ${cp:$((${#cp} - 1))} != [[:digit:]] ]]
        then    # No.  Add "0" as a tone mark.
                cp="${cp}0"
        fi
        # Strip everything but tone marks from the Chinese pinyin string.
        cp=${cp//[![:digit:]]/}
        # Print the Chinese characters with their corresponding tone marks.
        sep=""  # No separator for first character pair.
        for ((i = 0; i < ${#cp}; i++))
        do      printf "%s%s,%s" "$sep" "${c:$i:1}" "${cp:$i:1}"
                sep="," # Separator for all following character pairs.
        done
        # Add the trailing newline.
        echo
done < Word_list_1 3< Word_list_2