Shifting of data because of special characters

Hi Forum.

I have a unique problem that I'm hoping someone can assist me.

I'm generating a fixed width file and one of the output column (person_name at col. pos.#483 defined as string(36) sometimes contains french characters in the name and it causes the next column of data to shift to the left.

For example - First record is valid whereas the second record contains French characters and causes a shift of the next field to the left.

...0860LNAM01Tom*Brown                             999999999.....
...0860LNAM01RENLSQU                     999999999

To me, it appears that the person_name column is not using the full defined string(36) characters.

Any advice?

Thank you.

Can you show us the output from od -x for the above lines? That should give us the hex character codes to consider and maybe we will see something.

Thanks, in advance,
Robin

0000000 3830 3036 4e4c 4d41 3130 6f54 2a6d 7242
0000020 776f 206e 2020 2020 2020 2020 2020 2020
0000040 2020 2020 2020 2020 2020 2020 2020 3920
0000060 3939 3939 3939 3939 2020 2020 2020 2020
0000100 2020 3020 3031 3231 3030 5030 4553 464c
0000120 3030 2030 2020 2020 2020 2020 2020 2020
0000140 2020 2020 2020 2020 2020 2020 2020 2020
*
0000300 2020 2020 2050 2020 2020 2020 2020 2020
0000320 2020 2020 2020 2020 2020 2020 2020 2020
0000340 2020 2020 3020 3030 3030 3030 3030 3030
0000360 3231 3133 3332 3534 3736 2020 2020 2020
0000400 2020 2020 2020 2020 2020 2020 2020 2020
*
0000540 4520 554e 2020 2020 2020 2020 2020 2020
0000560 2020 2020 2020 2020 2020 2020 2020 2020
*
0000740 2020 2020 2020 4320 4441 2020 2020 2020
0000760 2020 2020 2020 2020 2020 2020 2020 2020
*
0001500 2020 2020 2020 2020 2020 3635 3730 3430
0001520 3638 2020 2020 2020 2020 2020 3020 3832
0001540 4c37 4441 3052 3031 3031 4831 4d4f 2045
0001560 2020 3231 2033 4241 2043 4452 2020 2020
0001600 2020 2020 2020 2020 2020 2020 2020 2020
*
0002020 2020 4f4d 544e c352 41a9 2d4c 4f4e 4452
0002040 2020 2020 2020 2020 2020 4351 4820 4831
0002060 3420 3259 2020 2020 2020 2020 2020 2020
0002100 2020 2020 2020 2020 2020 2020 2020 2020
0002120 3030 3030 3030 204e 2020 2020 2020 2020
0002140 2020 2020 2020 2020 2020 2020 2020 2020
0002160 2020 2020 2020 2020 2020 2020 3230 3430
0002200 444c 4345 4150 5350 4142 4354 2048 4154
0002220 474e 4b42 2020 2020 2020 2020 2020 2020
0002240 2020 2020 2020 2020 2020 2020 2020 2020
*
0002500 2020 2020 2020 2020 3030 3331 544c 5432
0002520 3031 3030 3130 3432 4c30 5355 3052 2031
0002540 2020 2020 2020 2020 2031 4e20 2020 2020
0002560 2020 2020 2020 2020 2020 2020 2020 2020
*
0003240 2020 2020 2020 2020 2020 2020 2020 3030
0003260 3030 3030 3030 2020 3043 3030 3030 3331
0003300 3436 3030 3130 3037 3030 3034 2020 2020
0003320 2020 2020 2020 2020 2020 2020 2020 2020
*
0004000 2020 3030 3030 2020 2020 2020 2020 2020
0004020 2020 2020 2020 2020 2020 2020 2020 2020
*
0004140 2020 2020 2020 2030 2020 2020 2020 2020
0004160 2020 2020 2020 2020 2020 2020 2020 2020
*
0004260 3020 3030 3030 3030 3030 3330 3030 3030
0004300 3030 3030 3030 3030 3030 3030 3030 3030
0004320 3030 3630 3032 3030 3030 3030 3030 3030
0004340 2030 2020 2020 2020 2020 2020 2020 2020
0004360 2020 2020 2020 2020 3020 3030 3030 3030
0004400 3030 3030 3030 3030 3030 3030 3030 3030
0004420 2030 2020 2020 2020 2020 2020 2020 2020
0004440 2020 2020 2020 2020 3020 3030 3030 3030
0004460 3030 3035 3030 3030 3030 3030 3030 3030
0004500 2030 2020 2020 2020 2020 2020 3020 3030
0004520 3030 3030 3030 3030 3039 3030 3030 3030
0004540 3030 3334 3033 3030 3030 3030 3030 3936
0004560 3035 3030 3030 3030 3030 3236 3035 3030
0004600 3030 3030 3030 3030 3030 3030 3030 3030
0004620 3030 3030 2030 2020 2020 2020 2020 2020
0004640 3020 3030 3030 3030 3030 3030 3030 3030
0004660 3030 3030 3030 3030 3030 3030 3030 3030
*
0004720 3030 3030 3030 3030 3030 3030 3030 3135
0004740 3238 3130 2038 2020 2020 2020 2020 2020
0004760 2020 2020 2020 2020 2020 2020 2020 2020
*
0005040 2020 2020 2020 2020 2020 2020 3120 3432
0005060 4c30 5355 3052 2032 2020 2020 2020 2020
0005100 2020 2020 2020 2020 2020 2020 2020 2020
*
0006600 2020 2020 2020 2020 3020 3030 3030 3030
0006620 3030 3334 3033 3030 3030 3030 3030 3334
0006640 3033 3030 3030 3030 3030 3035 3030 3030
0006660 3030 3030 3030 3030 3030 3030 3030 3030
*
0006740 3030 3030 3030 3630 3037 3030 3030 3030
0006760 3030 3630 3037 3030 3030 3030 3030 3030
0007000 3030 3030 3030 3030 3030 3030 3030 3030
*
0007040 3030 3030 2030 2020 2020 2020 2020 2020
0007060 3020 3030 3030 3030 3030 3035 2030 2020
0007100 2020 2020 2020 2020 2020 2020 2020 2020
*
0007400 2020 2020 3020 3030 5438 4e45 0d44 300a
0007420 3638 4c30 414e 304d 5231 4e45 a7ea 4caa
0007440 95e8 5385 5551 9ee7 20a0 2020 2020 2020
0007460 2020 2020 2020 2020 2020 2020 3920 3939
0007500 3939 3939 3939 2020 2020 2020 2020 2020
0007520 3020 3031 3231 3030 5030 4553 464c 3030
0007540 2030 2020 2020 2020 2020 2020 2020 2020
0007560 2020 2020 2020 2020 2020 2020 2020 2020
*
0007720 2020 2050 2020 2020 2020 2020 2020 2020
0007740 2020 2020 2020 2020 2020 2020 2020 2020
0007760 2020 3020 3030 3030 3030 3030 3030 3231
0010000 3133 3332 3534 3736 2020 2020 2020 2020
0010020 2020 2020 2020 2020 2020 2020 2020 2020
*
0010140 2020 2020 2020 2020 2020 2020 2020 4520
0010160 554e 2020 2020 2020 2020 2020 2020 2020
0010200 2020 2020 2020 2020 2020 2020 2020 2020
*
0010360 2020 2020 4320 4441 2020 2020 2020 2020
0010400 2020 2020 2020 2020 2020 2020 2020 2020
*
0011120 2020 2020 2020 2020 3335 3638 3433 3433
0011140 2020 2020 2020 2020 2020 3020 3832 4c37
0011160 4441 3052 3031 3031 4831 4d4f 2045 2020
0011200 4f4d 544e ea52 81a8 204c 2020 2020 2020
0011220 2020 2020 2020 2020 2020 2020 2020 2020
*
0011440 4f4d 544e e852 8c80 2020 2020 2020 2020
0011460 2020 2020 2020 2020 4351 4820 5431 3320
0011500 334e 2020 2020 2020 2020 2020 2020 2020
0011520 2020 2020 2020 2020 2020 2020 2020 3030
0011540 3030 3030 204e 2020 2020 2020 2020 2020
0011560 2020 2020 2020 2020 2020 2020 2020 2020
0011600 2020 2020 2020 2020 2020 3230 3430 444c
0011620 4345 4150 5350 4142 4354 2048 4154 474e
0011640 4b42 2020 2020 2020 2020 2020 2020 2020
0011660 2020 2020 2020 2020 2020 2020 2020 2020
*
0012120 2020 2020 2020 3030 3331 544c 5432 3031
0012140 3030 3130 3432 4c30 5355 3052 2031 2020
0012160 2020 2020 2020 2031 4e20 2020 2020 2020
0012200 2020 2020 2020 2020 2020 2020 2020 2020
*
0012660 2020 2020 2020 2020 2020 2020 3030 3030
0012700 3030 3030 2020 3043 3030 3030 3331 3436
0012720 3030 3130 3636 3034 3737 2020 2020 2020
0012740 2020 2020 2020 2020 2020 2020 2020 2020
*
0013420 3030 3030 2020 2020 2020 2020 2020 2020
0013440 2020 2020 2020 2020 2020 2020 2020 2020
*
0013560 2020 2020 2030 2020 2020 2020 2020 2020
0013600 2020 2020 2020 2020 2020 2020 2020 2020
*
0013660 2020 2020 2020 2020 2020 2020 2020 3020
0013700 3030 3030 3030 3030 3430 3033 3030 3030
0013720 3030 3030 3030 3030 3030 3030 3030 3030
0013740 3030 3030 3030 3030 3030 3030 3131 2032
0013760 2020 2020 2020 2020 2020 2020 2020 2020
0014000 2020 2020 2020 3020 3030 3030 3030 3030
0014020 3030 3030 3030 3030 3030 3030 3030 2030
0014040 2020 2020 2020 2020 2020 2020 2020 2020
0014060 2020 2020 2020 3020 3030 3030 3030 3030
0014100 3035 3030 3030 3030 3030 3030 3030 2030
0014120 2020 2020 2020 2020 2020 3020 3030 3030
0014140 3030 3030 3332 3036 3030 3030 3030 3030
0014160 3230 3034 3030 3030 3030 3030 3237 3036
0014200 3030 3030 3030 3030 3236 3037 3030 3030
0014220 3030 3030 3030 3030 3030 3030 3030 3030
0014240 3030 2030 2020 2020 2020 2020 2020 3020
0014260 3030 3030 3030 3030 3030 3030 3030 3030
*
0014340 3030 3030 3030 3030 3030 3030 3135 3238
0014360 3130 2038 2020 2020 2020 2020 2020 2020
0014400 2020 2020 2020 2020 2020 2020 2020 2020
*
0014460 2020 2020 2020 2020 2020 3120 3432 4c30
0014500 5355 3052 2032 2020 2020 2020 2020 2020
0014520 2020 2020 2020 2020 2020 2020 2020 2020
*
0016220 2020 2020 2020 3020 3030 3030 3030 3030
0016240 3230 3034 3030 3030 3030 3030 3230 3034
0016260 3030 3030 3030 3030 3035 3030 3030 3030
0016300 3030 3030 3030 3030 3030 3030 3030 3030
*
0016360 3030 3030 3734 3036 3030 3030 3939 3939
0016400 3939 3039 3030 3030 3030 3030 3030 3030
0016420 3030 3030 3030 3030 3030 3030 3030 3030
*
0016460 3030 2030 2020 2020 2020 2020 2020 3020
0016500 3030 3030 3030 3030 3030 2030 2020 2020
0016520 2020 2020 2020 2020 2020 2020 2020 2020
*
0017020 2020 3020 3030 5438 4e45 0d44 000a
0017035

What encoding / character set do you use? What locale ? Are your (text) tools multibyte encoding capable?

Hi pchang...

First observations are:

  1. You seem to have a flat file of pure spaces, ASCII character 0x20.
  2. The printout is little endian.
  3. The file has 2 occasions of a WINDOWS style <CR><NL> pair. 0d44 300a and 0d44 000a , that is xx0d 0axx when reversed...
  4. The unicode characters are doing exctly what they are supposed to do and fill up your spaces, however......
  5. As those spaces determine your layout then those uncode characters eat up your layout spaces anything from 1 to 3 spaces per character; not including the displayed character itself, depending on the unicode character.

I would suggest looking into RudiC's reply as a starter point for us to carry on...

Thanks for all of your replies.

We are using an ETL tool (informatica) to generate the file. Codepage is currently MS Windows Latin 1 and there are other codepages we can select from.

Not sure what the locale is - how do I find that out?

What be the OS and shell versions? Are you on a *nix system at all? If yes, the locale command will output your settings.
There seem to be a few non-ASCII characters in your file (as expected, BTW). Try the iconv or recode *nix commands to convert from your "codepage" to your locale char encoding.

Hi RudiC.

We are running our informatica application using Windows Server 2012 R2 OS.

The windows locale is defined as the following:

Current language for non-Unicode programs:

English (United States)

We also have cygwin installed on the same windows server to run our Unix scripts.

I ran the following command for the iconv program:

iconv -l
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7
UCS-2-INTERNAL
UCS-2-SWAPPED
UCS-4-INTERNAL
UCS-4-SWAPPED
C99
JAVA
CP819 IBM819 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN1
ISO-8859-2 ISO-IR-101 ISO8859-2 ISO_8859-2 ISO_8859-2:1987 L2 LATIN2 CSISOLATIN2
ISO-8859-3 ISO-IR-109 ISO8859-3 ISO_8859-3 ISO_8859-3:1988 L3 LATIN3 CSISOLATIN3
ISO-8859-4 ISO-IR-110 ISO8859-4 ISO_8859-4 ISO_8859-4:1988 L4 LATIN4 CSISOLATIN4
CYRILLIC ISO-8859-5 ISO-IR-144 ISO8859-5 ISO_8859-5 ISO_8859-5:1988 CSISOLATINCYRILLIC
ARABIC ASMO-708 ECMA-114 ISO-8859-6 ISO-IR-127 ISO8859-6 ISO_8859-6 ISO_8859-6:1987 CSISOLATINARABIC
ECMA-118 ELOT_928 GREEK GREEK8 ISO-8859-7 ISO-IR-126 ISO8859-7 ISO_8859-7 ISO_8859-7:1987 ISO_8859-7:2003 CSISOLATINGREEK
HEBREW ISO-8859-8 ISO-IR-138 ISO8859-8 ISO_8859-8 ISO_8859-8:1988 CSISOLATINHEBREW
ISO-8859-9 ISO-IR-148 ISO8859-9 ISO_8859-9 ISO_8859-9:1989 L5 LATIN5 CSISOLATIN5
ISO-8859-10 ISO-IR-157 ISO8859-10 ISO_8859-10 ISO_8859-10:1992 L6 LATIN6 CSISOLATIN6
ISO-8859-11 ISO8859-11 ISO_8859-11
ISO-8859-13 ISO-IR-179 ISO8859-13 ISO_8859-13 L7 LATIN7
ISO-8859-14 ISO-CELTIC ISO-IR-199 ISO8859-14 ISO_8859-14 ISO_8859-14:1998 L8 LATIN8
ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 ISO_8859-15:1998 LATIN-9
ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 ISO_8859-16:2001 L10 LATIN10
KOI8-R CSKOI8R
KOI8-U
KOI8-RU
CP1250 MS-EE WINDOWS-1250
CP1251 MS-CYRL WINDOWS-1251
CP1252 MS-ANSI WINDOWS-1252
CP1253 MS-GREEK WINDOWS-1253
CP1254 MS-TURK WINDOWS-1254
CP1255 MS-HEBR WINDOWS-1255
CP1256 MS-ARAB WINDOWS-1256
CP1257 WINBALTRIM WINDOWS-1257
CP1258 WINDOWS-1258
850 CP850 IBM850 CSPC850MULTILINGUAL
862 CP862 IBM862 CSPC862LATINHEBREW
866 CP866 IBM866 CSIBM866
CP1131
MAC MACINTOSH MACROMAN CSMACINTOSH
MACCENTRALEUROPE
MACICELAND
MACCROATIAN
MACROMANIA
MACCYRILLIC
MACUKRAINE
MACGREEK
MACTURKISH
MACHEBREW
MACARABIC
MACTHAI
HP-ROMAN8 R8 ROMAN8 CSHPROMAN8
NEXTSTEP
ARMSCII-8
GEORGIAN-ACADEMY
GEORGIAN-PS
KOI8-T
CP154 CYRILLIC-ASIAN PT154 PTCP154 CSPTCP154
KZ-1048 RK1048 STRK1048-2002 CSKZ1048
MULELAO-1
CP1133 IBM-CP1133
ISO-IR-166 TIS-620 TIS620 TIS620-0 TIS620.2529-1 TIS620.2533-0 TIS620.2533-1
CP874 WINDOWS-874
VISCII VISCII1.1-1 CSVISCII
TCVN TCVN-5712 TCVN5712-1 TCVN5712-1:1993
ISO-IR-14 ISO646-JP JIS_C6220-1969-RO JP CSISO14JISC6220RO
JISX0201-1976 JIS_X0201 X0201 CSHALFWIDTHKATAKANA
ISO-IR-87 JIS0208 JIS_C6226-1983 JIS_X0208 JIS_X0208-1983 JIS_X0208-1990 X0208 CSISO87JISX0208
ISO-IR-159 JIS_X0212 JIS_X0212-1990 JIS_X0212.1990-0 X0212 CSISO159JISX02121990
CN GB_1988-80 ISO-IR-57 ISO646-CN CSISO57GB1988
CHINESE GB_2312-80 ISO-IR-58 CSISO58GB231280
CN-GB-ISOIR165 ISO-IR-165
ISO-IR-149 KOREAN KSC_5601 KS_C_5601-1987 KS_C_5601-1989 CSKSC56011987
EUC-JP EUCJP EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE CSEUCPKDFMTJAPANESE
MS_KANJI SHIFT-JIS SHIFT_JIS SJIS CSSHIFTJIS
CP932
ISO-2022-JP CSISO2022JP
ISO-2022-JP-1
ISO-2022-JP-2 CSISO2022JP2
CN-GB EUC-CN EUCCN GB2312 CSGB2312
GBK
CP936 MS936 WINDOWS-936
GB18030
ISO-2022-CN CSISO2022CN
ISO-2022-CN-EXT
HZ HZ-GB-2312
EUC-TW EUCTW CSEUCTW
BIG-5 BIG-FIVE BIG5 BIGFIVE CN-BIG5 CSBIG5
CP950
BIG5-HKSCS:1999
BIG5-HKSCS:2001
BIG5-HKSCS:2004
BIG5-HKSCS BIG5-HKSCS:2008 BIG5HKSCS
EUC-KR EUCKR CSEUCKR
CP949 UHC
CP1361 JOHAB
ISO-2022-KR CSISO2022KR
CP856
CP922
CP943
CP1046
CP1124
CP1129
CP1161 IBM-1161 IBM1161 CSIBM1161
CP1162 IBM-1162 IBM1162 CSIBM1162
CP1163 IBM-1163 IBM1163 CSIBM1163
DEC-KANJI
DEC-HANYU
437 CP437 IBM437 CSPC8CODEPAGE437
CP737
CP775 IBM775 CSPC775BALTIC
852 CP852 IBM852 CSPCP852
CP853
855 CP855 IBM855 CSIBM855
857 CP857 IBM857 CSIBM857
CP858
860 CP860 IBM860 CSIBM860
861 CP-IS CP861 IBM861 CSIBM861
863 CP863 IBM863 CSIBM863
CP864 IBM864 CSIBM864
865 CP865 IBM865 CSIBM865
869 CP-GR CP869 IBM869 CSIBM869
CP1125
EUC-JIS-2004 EUC-JISX0213
SHIFT_JIS-2004 SHIFT_JISX0213
ISO-2022-JP-2004 ISO-2022-JP-3
BIG5-2003
ISO-IR-230 TDS565
ATARI ATARIST
RISCOS-LATIN1

Based on what we know so far (file created using codepage "MS Windows Latin 1" and locale of windows server is "English (US), what file type should I be using for the iconv command to convert the file type?

Thank you for all your help so far.

I don't know. This is very obscure. Where does the problem occur, in your cygwin environment?
You may have incompatibility problems between informatica and windows and cygwin, and I'm afraid I can't help on either.

I'm guessing it's between informatica and Windows.

We are not using cygwin (or running any type of scripts) against the file once it has been generated.

Is it just the french characters that are causing issues, if there are only Chinese characters in the name are the positions correct?