Text File with Binary Values processing

Hello all,
I have a txt file containing millions of lines. Below is the example:

{tx:be} head -50 file.txt 
Instr1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 

Instr1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000000000000000000000001100001010000000010011101001111000000000100010100111111110010000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000000000000000000000001100001010000000010010101001111000000000100010100111111111000000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000000000000000000000001100001010000000000000101000011000000000100010101000000000010000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000110110000000000000000100001010000000010011100101000000000000100010101000000001110000000000000000000000000000000000000001 

Instr1: 000000000100000000000000000001111110000000000000000000010000000000100010101000000010110000000000000010001001111011011100000111 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000110111000111110000100100001010000000011100100101000000000000100010011110110111000000000000000000000000000000000000000001 

Instr1: 000001010110000000000100000100000000101001011101100110100000000000100010011110111000000000000000000000000000000000000000000001 

Instr1: 000001010110000000000011000100000000100101011101100110100000000000100010011110111001000000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000110111000111101110100100001010000000011100100101000000000000100010011110111010100000000000000000000000000000000000000001 

Instr1: 000000110111000111101110100100001010000000011100100101000000000000100010011110111010100000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000100110000000000001011100001110000000010011100100010000000000100010100111111110110000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000000000000000000000001100001010000000000000101000011000000000100010101000000000010000000000000000000000000000000000000001 

Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 

Instr1: 000000110110000000000000000100001010000000010011100101000000000000100010101000000001110000000000000000000000000000000000000001 

Instr1: 000000000100000000000000000001111110000000000000000000010000000000100010101000000010110000000000000010001001111011011100000111 

There are empty lines which I take off using "sed 's/^$/d' file.txt"

Now the problem is, I want to find number of uniq values on the binary field. Here is what I want:
in the binary values, I was to find how many times the uniq values in field [57:50] are occuring. (MSB -> 125, LSB -> 0). There are total 126 bits in the lines.
I have sorted the files using sort:

sort -k2.50,2.57 file.txt

output:
{tx:be} tail -50 file.txt 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000011111111110000000000100010100000001101110000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000100111111110000000000100010100000100101100000000000000000000000000000000000000001 
Instr1: 000000000010000000000000000010000010100111000100111111110000000000100010100000100101100000000000000000000000000000000000000001 
Instr1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
Instr1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 

As you can see, the files are sorted based on the fields that I am interested in. Now I am not sure how to find the Number of occurence (uniq) in those fields.

I have tried the uniq command, but surely it doesn't help:

uniq -c -f1 -s75 -w69 file.txt

Output: (truncated)
2751026 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
     23 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001 
     23 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000001 
     23 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000001 
     24 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000001 
     24 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000 
     22 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000101000000000000000000000000000000000000000001 
     19 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000110000000000000000000000000000000000000000001 
     17 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000111000000000000000000000000000000000000000000 
     18 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000001 
     18 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001001000000000000000000000000000000000000000001 
     17 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001010000000000000000000000000000000000000000001 
     14 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001011000000000000000000000000000000000000000001 
      8 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001100000000000000000000000000000000000000000001 
     11 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001101000000000000000000000000000000000000000001 
      6 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001110000000000000000000000000000000000000000001 
      5 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000001111000000000000000000000000000000000000000001 
      1 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000 
      2 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000001 
      4 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010001000000000000000000000000000000000000000001 
      4 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010010000000000000000000000000000000000000000001 
      4 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010011000000000000000000000000000000000000000001 
      4 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010100000000000000000000000000000000000000000001 
      3 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000010101000000000000000000000000000000000000000001 
     11 Instr1: 000000000100000000000000000000000000000000000000000000001000000000100001110001111111000000000000000010000111001000010010000001 
     12 Instr1: 0000000001000000000000000000000000000000000000000000000010

What I am looking for in output is perhaps: (i am randomly putting values here)

2000 Instr1[or any sutitable text]: '00000000'
150 Instr1:  '10001100'
120 Instr1: '00100000'
and so on

I think the 'uniq' command should be ok, but I am open to anything.

Thanks in advance.

How about

awk '/^ *$/ {next} {C[substr ($0, 50, 8)]++} END {for (c in C) printf "%4d Instr1: %s\n", C[c], c}' file
   3 Instr1: 00111001
   2 Instr1: 10111011
   2 Instr1: xxxxxxxx
  11 Instr1: 00000000
   2 Instr1: 00000001
   1 Instr1: 00100101
   4 Instr1: 00100111

You may want to change $0 to $2 if you need to count char positions only within the string of binary digits. sort to taste...

1 Like

If I understand your file contents (from your two examples) and assuming that the 1st field in your input is not always Instr1: , you might want to try this slight modification to RudiC's suggestion:

#!/bin/ksh
file=${1:-file.txt}
awk '
!/^ *$/{c[$1 OFS substr($2, 50, 8)]++
}
END {	for(v in c)
		printf("%12d %s\n", c[v], v)
}' "$file" | sort -k1,1nr -k2,3

If file.txt contains you 1st sample (unsorted with blanks lines) and file2.txt contains your 2nd sample (sorted with no blank lines), the above script when invoked without operands produces the output:

           9 Instr1: 00000000
           5 Instr1: 00101000
           2 Instr1: 00000010
           2 Instr1: 00110100
           2 Instr1: 01000011
           2 Instr1: 01001111
           2 Instr1: xxxxxxxx
           1 Instr1: 00100010

and, if it is invoked with the operand file2.txt , produces the output:

          48 Instr1: 11111110
           2 Instr1: xxxxxxxx

This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax and understands the parameter expansions required by the POSIX standards (such as bash , dash , ksh , and zsh ).

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

Is there any way to count from the LSB (Which is start from the right hand side of the binaries in this case)?

Yes, of course. Simple arithmetics. Any idea from your side?

I am assuming simple C[-1] would start from the end of line.

In shell, yes. In awk , sorry, no. Use length function and subtract target position.