Problem with cut and wc

donadarsh · May 24, 2012, 9:48am

Hi, I am facing issue with cut and wc. here is the sample.

the data in file -

tail -1 05_19_BT_TBL_LOAD_20120524064242.bad|cut -c9-58
WatsSaver - AGGREGATED PLAN1581 CALLS FOR 2872.6

tail -1 05_19_BT_TBL_LOAD_20120524064242.bad|cut -c9-58|wc -c
        51
  
tail -1 05_19_BT_TBL_LOAD_20120524064242.bad|cut -c9-59
WatsSaver - AGGREGATED PLAN1581 CALLS FOR 2872.6 M

 tail -1 05_19_BT_TBL_LOAD_20120524064242.bad|cut -c9-59|wc -c
      52

Its should give result as 50 and 51 respectively, why 51 and 52? Does it count any character twice?

methyl · May 24, 2012, 9:55am

It counts the linefeed at the end of the line as a character.

echo "1234"|wc -c
5
echo "1234"|tr -d '\n'|wc -c
4

elixir_sinari · May 24, 2012, 9:55am

When you are asking cut to cut columns from 9 to 58, aren't you asking it cut 51 columns?

donadarsh · May 24, 2012, 10:01am

This data is getting loaded into database by using sql loader based on character position. The loader has thrown error that "value too large for column" (actual: 51, maximum: 50).

I saw the error and tried to test it on shell with actual data. I took the data from position 9-58 and counted. As other lines got loaded successfully, it has created error like this. That's what my confusion is, how its counting one more character?

elixir_sinari · May 24, 2012, 10:06am

OK..as methyl said, newlines are also counted. An octal dump will verify this.

tail -1 test1
done

tail -1 test1|cut -c 1-3
don

tail -1 test1|cut -c 1-3|wc -c
       4

tail -1 test1|cut -c 1-3|od -bc
0000000  144 157 156 012
           d   o   n  \n
0000004

You can use tr to delete the new-line...

donadarsh · May 24, 2012, 10:14am

okay, thanks a lot for helping me to understand the wc -c work. But the confusion still remains (SQL loader part) for the same data other rows have been loaded but some are not. Is there any possibility that other character are included in the same line but not visible to me on shell prompt?

methyl · May 24, 2012, 11:16am

What Operating System and version are your running and what Shell is this?

Are any Microsoft Operating Systems involved anywhere in the process?

What character set do you use? There is scope for UTF to give misleading character counts if the database was not set to expect UTF.

Is this all within the same computer? i.e. no file transfers, conversions or whatever?

Try the od command mentioned in earlier posts on a bad record.

donadarsh · May 25, 2012, 1:31am

I am using Solaris 10.
No Microsoft OS involved in this except I use putty to login to UNIX machine from my Windows os on client machine.
Its complete ASCII character set what we use.
Yes, its on same computer but db is on other machine, shouldn't be a matter.

methyl · May 25, 2012, 8:34am

... and the results from od on a bad record?

donadarsh · May 25, 2012, 8:52am

The records have special characters. something like \224 and \230 which is coming as single character on shell prompt after using vi.

methyl · May 25, 2012, 9:25am

Hmm. Extended ASCII (greater than Octal 127). Probably come from a Microsoft system or possibly a foreign language ASCII character set? There is potential for these characters to convert to two-character UTF in your SQL loader or possibly some other multi-character sequence (like the octal sequence cited).
We know so little about your software that this is pure speculation.

donadarsh · May 25, 2012, 9:28am

the files are coming from mainframe to UNIX via connect:direct software , so that can be one reason.

methyl · May 25, 2012, 9:48am

This is a common problem in multi-platform system design.
I recommend that you instigate a systematic software re-test on a test system, paying particular attention to Extended ASCII character sets.

The big decision is what to do with each Extended ASCII character such that the data loads correctly into your database. This is not trivial.