Identify high values "ÿ" in a text file using Unix command

devina · March 14, 2011, 5:58am

I have high values (such as ��) in a text file contained in an Unix AIX server. I need to identify all the records
which are having these high values and also get the position/column number in the record structure if possible. Is there
any Unix command by which this can be done to :

Get the number of occurrences of high values in the file
Get the position/column of it in the record structure (optional)

I tried the option of echo "�" but it is not able to detect them. The ascii equivalent for "�" is 255 and I have tried
searching for the option of trying to grep using the ascii value but got no results.

Please let me know if there is any way to achieve this?

Thanks!

panyam · March 14, 2011, 6:21am

 
can you try something like this:
 
cat > input_file
abcZ]fdsGGG
 
printf `cat input_file ` | od -An -to1
141 142 143 132 135 146 144 163 107 107 107
 
Now , based on ascii value , you have to search for the special character and then you can get that position as well.

devina · March 14, 2011, 8:24am

Hi Panyam,

We have huge files to work upon and looking for a faster and better way other than converting each character to its ascii value.Also, the record layout changes when we try your command which would give incorrect position. To mention all the files will consist of fixed length records

Is there a way where we can run grep using the ascii value of this special character directly? Is there any better way to do this?We will just need to locate this special character and find the number of occurrences,number of records its impacting and the fields impacted based on the position.Any help on this is highly appreciated.

Thanks for replying.

methyl · March 14, 2011, 9:20am

What Operating System and version are you running?
What Shell do you use?
Do you have a high-level language such as Oracle available or are you trying to do this with Shell tools and unix commands.

How big are the files?
How long is each record?

Ir you looking for characters 128-255 inclusive, or just character 255, or something else?

What are you going to do with the results? Are you going to try to change characters?

devina · March 14, 2011, 1:29pm

Methyl,
Below are the answers.
What Operating System and version are you running?AIX 5.3
What Shell do you use?Korn Shell
Do you have a high-level language such as Oracle available or are you trying to do this with Shell tools and unix commands.-Oracle not readily available..trying with shell and unix commands
How big are the files?-Files fall in the range of 20MB-10GB
How long is each record?Record length fall in the range of 1-1000 and some even above.
Ir you looking for characters 128-255 inclusive, or just character 255, or something else?Just the 255 character i.e. small y with diaeresis as mentioned earlier.

What are you going to do with the results? Are you going to try to change characters? I will not be changing/replacing this character.I just to have the total number of occurrences in the file,number of records having this characters,number of occurrences per record and position of the character to identify which field is impacted.These results are need for analysis.

vgersh99 · March 14, 2011, 1:41pm

something to start with - assuming your LOCALE is set correctly:

nawk 'BEGIN{y=sprintf("%c", 0255)}$0 ~ y{n+=gsub(y,"");r++}END{printf("totatlRecords->%d totalChars->%d\n",r,n)}' myFile