Problem in comparing 2 files string by string

jitendra.pat04 · August 7, 2011, 9:53am

Hi Champs,
I am a newbie to unix world, and I am trying to built a script which seems to be far tough to be done alone by me.....

" I am having a raw csv file which contains around 50 fields..."
From that file I have to grep 2 fields "A" and "B"....field "A" is to be aligned vertically while field "B" is to be aligned horizontally...

Both the fields may have around 10-20 different numbers...so that I would be getting two arrays of numbers...vertical for field A and horizontal for field B....
#########CODE#########

#!/bin/bash
cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g 

cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g

#############
This code is giving me the unique counts for both fields in sorted form....
now I have made 2 files 1.txt and 2.txt having sorted array of fields A and B....
##############CODE###########

cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g | awk '{print $2}' > 1.txt

cat raw_file.csv | awk -F "," '{ print $B}' | sort | uniq -c | sort -k 2g | awk '{printf "%5s", $2}' > 2.txt

paste 1.txt 2.txt > 3.txt

##################
here I got the sorted aligned 2 arrays in file 3.txt.....
upto here is fine..

I need the individual count for each string of hoz array wrt each string of vertical array......
I want the output like....
##########

          B1 B2 B3 B4
A1     
A2   
A3
A4

###############
(vertical array is also having some blank values....)

Please help.....!!!!

birei · August 7, 2011, 12:24pm

Hi,

An example of your input file would be useful to try to get a solution. At least for me.

Regards,
Birei

agama · August 7, 2011, 1:49pm

If I am reading your post correctly, what you have is a file like this:

a,b,c,d,e,f,g,h
g,b,c,d,f,a,i,j
c,e,d,f,h,a,b,a
h,b,j,a,b,c,d,f
a,b,c,e,a,b,c

And for any two colums (2 and 3 lets say) you want to know for each value in column two, the number of times a value in column three appeared with that value. For instance, the value 'b' appears in column 2 with the value 'c' three times and the value j once. So for the above matrix, the desired output would be:

     COLA/COLB    c   d   j
      b           3   0   1
      e           0   1   0

which shows that column 2 (your A column) had two values (b and e) and column 3 (your B column) had three values (c, d, and j) with the listed combinations of counts.

If this makes sense, and is what you desire, then I think all you need is a simple awk to process your csv file directly:

#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    {
        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);

        count[$(c1),$(c2)]++;
    }

    END {
        printf( "%15s ", "COLA/COLB" );
        for( j=0; j < o2idx; j++ )              # header
            printf( "%15s ", order_2[j] );
        printf( "\n" );

        for( i = 0; i < o1idx; i++ )            # print matrix
        {
            printf( "%15s ", order_1 );
            for( j=0; j < o2idx; j++ )
                printf( "%15d ", count[order_1,order_2[j]] );

            printf( "\n" );
        }
    }
'

exit

The script takes colA and colB from the command line and reads the csv file from stdin. If you put this into foo.ksh, the command line to look at columns 5 and 10 would be:

foo.ksh 5 10 <file.csv

If no colums are given on the command line, the defaults are columns 1 and 2.

If this isn't what you had in mind, please clarify and add some sample input and desired output based on the sample.

jitendra.pat04 · August 9, 2011, 3:39pm

Thanks very much friend,
Basically my problem is solved upto 80%....but here are some more modifications that are needed in the o/p....
1) my csv file is very big and having around 50 fields out of which for field 39 there are 4 values say 1,2,3,4....
for each value of 39th field I have to manipulate the file like such that field 21 comes in vertical and field 26 comes in horizontal.....this 26th field is having many entries upto 15-20......for example....
cat jitendra.csv | awk -F "|" '{if ($39==3) print $0}'.....now for this manipulated file I have to work accordingly for field 21 & 26.......

2) the o/p for 21 and 26 field is not in sorted manner,,, actually I need it in sorted manner......

3) some of the values in field 21(vertical) are blank......so for that blank values I would like to print a "blank"....but no entry in hoz array is blank....

4) also the o/p is not aligned as pasted below.....

####################
COLA/COLB 9008 8739 8822
0 3281 874 389
255 123 18 11
1 3 2 0
133 6 0 0
254 120 6 3
19 240 7 17
137 110 13 3
34 22 0 0
17 4 0 0
###################

..................................................
Actually friend I had to do a lot of modifications more n this file......aside to the above queries.....

can u just help me in understanding that code....because I am a newbie and half of the code is not understandable by me.....
so that I could make some modifications by myself.....

............Any help wud be highly appreciated....thanx in advance.....

agama · August 9, 2011, 9:43pm

Here's the code with comments in hopes that you can figure out what is going on. I added a sort (some awk versions don't have a built-in sort function, so I wrote one that shouldn't be too inefficient based on the number of rows/columns you've indicated.

#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    function sort_order(  what, l,       i, j )
    {
        for( i = 0; i < l; i++ )        # big pass across array
        {
            big = 0;                      # set first entry as the largest value
            for( j = 1; j < l-i; j++ )       # small pass (up to length - i)
                if( what[j] > what[big] )  # if this value larger, mark it 
                    big = j;
            if( big != j-1 )              # if largest is not the last checked
            {
                hold = what[j-1];      # swap the last for the largest
                what[j-1] = what[big];
                what[big] = hold;
            }
        }

    }

    {
        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        count[$(c1),$(c2)]++;               # count the number of times the pair (c1,c2) appear together
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );

        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1 );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[order_1,order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }
'

exit

As for it not lining up correctly, that seems odd. Each field is printed with a width of 15 characters (%15d or %15s) and provided that none of the data is more than 15 characters wide it should space out nicely. You might need to reduce the setting (from 15 to 8 or something like that) in order to suit your needs. The programme also prints the word "BLANK" for empty data (in column A) and does not count anything if column B is missing. I assumed that missing columns were adjacent comas (,,).

For testing I ran this small set of data:

6000,1000,100,5000,3000,100,1000,1000,2000,4000
2000,7000,4000,,8000,9000,3000,100,1000,2000
4000,6000,6000,9000,100,7000,8000,7000,9000,8000
4000,4000,5000,,8000,1000,6000,6000,5000,9000
9000,5000,8000,100,6000,4000,8000,8000,9000,7000
7000,1000,100,7000,8000,4000,6000,2000,5000,6000
100,5000,6000,3000,5000,5000,100,4000,4000,6000
7000,9000,4000,6000,4000,8000,8000,6000,6000,9000
9000,1000,4000,,5000,1000,1000,2000,9000,8000
7000,1000,5000,1,5000,3000,100,3000,4000,1000

to ensure sorting and blanks worked. With the command test_script 4 5 <test_data the following was the output:

      COLA/COLB             100            3000            4000            5000            6000            8000 
              1               0               0               0               1               0               0 
            100               0               0               0               0               1               0 
           3000               0               0               0               1               0               0 
           5000               0               1               0               0               0               0 
           6000               0               0               1               0               0               0 
           7000               0               0               0               0               0               1 
           9000               1               0               0               0               0               0 
          BLANK               0               0               0               1               0               2

The columns are 15 characters wide, and the BLANK value will sort out last if the data in the input is all numeric. You can replace "BLANK" with " " if you just want a space in the output. Unfortunately, I cannot say what might be happening if your columns are misaligned as it works for me in my test environment.

Hope this has been of some value.

jitendra.pat04 · August 10, 2011, 1:50am

Now it seems more accurate,,
a few issues are left again.....

I have attached for reference...my file..............

1) I have to cat my file and put filter for $49....as this field had 4 values.....1,2,3,4..........and then I have to process the above entire code for each value respectively.....
so can I use this code like......

cat scripts_test.txt | awk -F "," '{if ($49==1) print $0}' | CODE........????

2) now the field that I want in vertical is $31....and in hoz is $36................also for $31==255....we have to check wheather $13==0 or $13!=0....
if $13!=0.....then count of value for which $31==255....should be added to the count of value for $31==0......for each entry of $36....
else the value shuld be the same for $31==255..............

3) As I have to automate this script in crontab....so I cannot print in command line "test.sh 31 36 <script.test.txt"

can I directly insert my fields as ......
awk -F "," -v c1=${31:-31} -v c2=${36:-36} '....and run the script as
./test_sh.........which will read the data from the file script_test.txt......????

Please suggest.....!!!!!!!!!!!!

Lots of thanx in advance friend....!!!!

agama · August 10, 2011, 9:07pm

Glad it's working! To answer your questions first:

Yes, but you will need to redirect your file in like this:

./test_sh <script_test.txt

Yes, but you don't need cat. Awk can read the file directly so something like this:

awk -F "," '{if ($49==1) print $0}' scripts_test.txt| CODE........????

However, you can have the programme that generates the matrix make a single pass across your input data and generate all 4 (or however many different values $49 has) matrices. Given that, the script below will print one matrix for each unique value in $49. Thus, there is no need to run the preprocess script that prints only records where $49 == 1 etc.

It will take from the command line, in this order, the column to print vertically, the column to print horizontally, the column which identifies the matrix. If no parms are given, then the defaults are 31 (vert) 36 (horiz) and 49 (id column).

Finally, in the special case where c1 has the value of 255, and $13 is non-zero, the count will be updated as though the value of c1 was zero.

#!/usr/bin/env ksh

# $1 == c1 == vertical
# $2 == c2 == horizontal
# $3 == matrix id col (midc)
#
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '
    function sort_order(  what, l,  i, j )
    {
        for( i = 0; i < l; i++ )
        {
            big = 0;
            for( j = 1; j < l-i; j++ )
                if( what[j] > what[big] )
                    big = j;
            if( big != j-1 )
            {
                hold = what[j-1];
                what[j-1] = what[big];
                what[big] = hold;
            }
        }
    }

    {
        mid = $(midc);                      # pick up the matrix id
        if( ! mseen[mid]++ )
            order_m[moidx++] = mid;         # capture for print at end

        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        if( $(c1) == 255 && $13 != 0  )     # special case: c1 is 255 and $13 is non-zero, count as though c1 was 0
            $(c1) = 0;

        count[mid,$(c1),$(c2)]++;           # count the number of times the pair (c1,c2) appear together
    }

    function print_matrix( mid,     j, i )
    {
        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1 );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[mid,order_1,order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );
        sort_order( order_m, length( order_m ) );

        for( i = 0; i < moidx; i++ )
        {
            printf( "\nMatrix: %s\n", order_m );     # header for the matrix
            print_matrix( order_m );
        }
    }
'

exit

Assuming you save this file in /home/userx/bin/gen_matrix.ksh and your test data is always in /usr2/data/matrix_input.csv, then you can schedule your job in cron with this command:

/home/userx/bin/gen_matrix.ksh /usr2/data/matrix_input.csv >/tmp/matrix.out

It will always write the output to the same file. If your input file will be different each day (maybe having the date), or you want your output file to be diffent each day (again with the date), then you will need to construct your input and output filenames in the script and use them on the awk command. Your cron command then just consists of the name of the script.

Examples of this:

#!/usr/bin/env ksh

date=$( date "+%Y%m%d" )   # date in yyyymmdd (mmdd sorts in order where ddmm does not)
input_file=/usr2/data/matrix_input_$date.csv   # build your file names 
output_file=/usr2/output/matrix_out.$date

# run the awk programme using your filenames on the last line
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '


#### body of awk programme ######


' $input_file >$output_file    # read from, write to, filenames you created

Hope this gets you further along.

jitendra.pat04 · August 11, 2011, 11:19am

Hi Agama,

Its almost done,
now the only issues left are....
1) blanks are not getting printed....
2) and some of the columns that are entirely blank ie: having the values as "all zeros" are not needed.....
3) in some cases the Blank column is also having some values for hoz fields....I want those values also to be added to the vertical column having values 255......I am pasting the O/p for your reference......

###############################

Matrix: 1
COLA/COLB 5065 5294 5672 5673 8059 8505 8508 8549 8587 8600 8632 8649 8706 8709 8725 8739 8748 8778 8798 8822 9008 9404
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 127 1140 1 79 84574 17436 23729 47678 9758 36855 102332 40437 34852 11229 15867 0 26227 36353 120463 1099 2504 10
1 0 0 0 0 148 32 143 80 30 128 207 89 100 16 26 0 78 116 262 4 1 0
3 0 0 0 0 53 52 9 1002 1 233 111 36 357 124 23 0 62 117 1548 2 10 0
17 0 0 0 0 3126 747 2010 3406 600 3579 9757 2633 1966 457 1526 0 3294 2582 14497 299 495 0
19 25 169 0 31 6408 1426 2931 80023 625 4306 17609 4860 3354 1083 2089 0 2818 4125 10834 120 260 0
28 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0
34 0 3 0 0 2928 736 860 2130 230 1625 2901 927 2976 842 572 0 625 2186 4527 27 23 0
38 0 0 0 0 2 0 0 26 0 6 2 13 4 1 5 0 2 2 7 1 1 0
131 0 0 0 0 7 0 1 0 0 0 10 6 1 0 1 0 0 5 16 0 4 0
133 0 6 0 21 3367 719 1188 3778 231 1014 3209 1943 1228 492 717 0 1513 1426 4204 113 74 0
137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
138 0 0 0 0 127 43 98 122 14 36 201 144 131 52 46 0 12 210 485 0 0 0
254 2 31 0 0 5386 928 1584 4428 621 2071 5126 2576 2324 633 1064 0 2022 2798 5221 80 116 0
255 0 4 0 0 1162 366 428 917 202 555 1408 471 782 190 332 0 396 617 1777 9 40 0

Matrix: 2
COLA/COLB 5065 5294 5672 5673 8059 8505 8508 8549 8587 8600 8632 8649 8706 8709 8725 8739 8748 8778 8798 8822 9008 9404
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 4038 0 1309 583 302 0 0 108 228 0 171 30293 208 341 0 11120 5149 0
1 0 0 0 0 275 0 0 2 10 0 0 4 4 0 24 1383 4 16 0 468 305 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 47 0 0
17 0 0 0 0 5 0 0 0 0 0 0 0 7 0 0 11 9 0 0 3 1 0
19 0 0 0 0 1828 0 24 27 31 0 0 0 23 0 13 439 20 61 0 1203 75 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0
34 0 0 0 0 87 0 165 0 22 0 0 22 2 0 0 2813 3 23 0 360 115 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
133 0 0 0 0 825 0 43 75 27 0 0 15 85 0 105 1985 12 242 0 999 364 0
137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 1 0
138 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
254 0 0 0 0 175 0 31 7 8 0 0 4 8 0 4 473 3 10 0 239 85 0
255 0 0 0 0 41 0 9 1 3 0 0 0 3 0 2 138 2 1 0 74 14 0

Matrix: 3
COLA/COLB 5065 5294 5672 5673 8059 8505 8508 8549 8587 8600 8632 8649 8706 8709 8725 8739 8748 8778 8798 8822 9008 9404
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 987 0 0 0 413 3492 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 1 132 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 7 0
138 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 3 15 0
255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 1 24 0

Matrix: 5
COLA/COLB 5065 5294 5672 5673 8059 8505 8508 8549 8587 8600 8632 8649 8706 8709 8725 8739 8748 8778 8798 8822 9008 9404
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1772 126 877 1056 256 1413 3749 544 1549 481 152 0 998 746 4304 115 97 0
1 0 0 0 0 0 1 0 1 0 1 5 0 0 2 0 0 1 1 5 0 0 0
3 0 0 0 0 0 0 0 168 0 0 0 0 79 0 0 0 0 0 30 12 0 0
17 0 0 0 0 70 46 2 40 1 1 120 12 147 47 2 0 5 66 185 2 3 0
19 0 0 0 0 71 24 33 44 17 65 122 32 35 19 26 0 42 24 186 0 3 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 2 0 50 1 0 0 2 0 9 6 1 0 0 0 16 0 0 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
133 0 0 0 0 107 9 39 22 4 53 112 36 98 21 6 0 37 72 133 2 2 0
137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
138 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
254 0 0 0 0 88 36 29 34 11 37 107 28 47 11 12 0 29 32 58 2 2 0
255 0 0 0 0 100 14 40 95 7 25 153 27 48 11 4 0 28 51 181 0 3 0

Matrix: 6
COLA/COLB 5065 5294 5672 5673 8059 8505 8508 8549 8587 8600 8632 8649 8706 8709 8725 8739 8748 8778 8798 8822 9008 9404
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 210 0 0 0 0 0 0 0 0 0 0 6463 0 0 0 1025 566 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 62 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 466 0 0 0 40 38 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 0 0 0 3 5 0
131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
133 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 245 0 0 0 120 75 0
137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0
138 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
254 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 103 0 0 0 11 10 0
255 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 730 0 0 0 81 69 0

###################

Here the above most vertical entry is for "blank" but blank is not getting printed....also for the hoz value 8798, 1 count is in blank,,,I want this also to be added in vertical entry 255 ie: the last entry in vertical column.....
8739 in hoz array is having all entries as zero.....need to remove this.....

################################################
Also dear friend can u just give me some ideas how to learn to write such a nice and effective short length codes,,,,,as I am a newbie and do not have much ideas of C language......any links or pdf's wud be very very highly appreciated.....

Thanx in advance ....

agama · August 11, 2011, 11:22pm

Tweeks to the code to eliminate rows/columns that are all zeros, and maybe help with blanks.

#!/usr/bin/env ksh

# $1 == c1 == vertical
# $2 == c2 == horizontal
# $3 == matrix id col (midc)
#
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '
    function sort_order(  what, l,  i, j )
    {
        for( i = 0; i < l; i++ )
        {
            big = 0;
            for( j = 1; j < l-i; j++ )
                if( what[j] > what[big] )
                    big = j;
            if( big != j-1 )
            {
                hold = what[j-1];
                what[j-1] = what[big];
                what[big] = hold;
            }
        }

    }

    {
        gsub( " ", "", $0 );            # if there are truely blanks (, ,) and not nils (,,) then ditch them
        gsub( "\t", "", $0 );

        mid = $(midc);                      # pick up the matrix id
        if( ! mseen[mid]++ )
            order_m[moidx++] = mid;         # capture for print at end

        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        if( $(c1) == "255" && $13 != "0"  )     # special case: c1 is 255 and $13 is non-zero, count as though c1 was 0
            $(c1) = "0";

        count[mid,$(c1),$(c2)]++;           # count the number of times the pair (c1,c2) appear together
        row_nonz[mid,$(c1)] = 1;            # flag row as having nonzero value for this matrix
        col_nonz[mid,$(c2)] = 1;            # flag col as having nonzero value for this matrix
    }

    function print_matrix( mid,     j, i )
    {
        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            if( col_nonz[mid,order_2[j]] )          # print only if in this matrix we have a nonzero in the column
                printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            if( row_nonz[mid,order_1] )                  # only if there was a non-zero value in the row
            {
                printf( "%15s ", order_1 );      # print the c1 value (again width of 15)
                for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                    if( col_nonz[mid,order_2[j]] )  # print only if in this matrix we have a nonzero in the column
                        printf( "%15d ", count[mid,order_1,order_2[j]] );    # print each (width 15 again)

                printf( "\n" );                     # end the row by printing a newline
            }
        }
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );
        sort_order( order_m, length( order_m ) );

        for( i = 0; i < moidx; i++ )
        {
            printf( "\nMatrix: %s\n", order_m );     # header for the matrix
            print_matrix( order_m );
        }
    }
'

I'm a bit confused as to how blanks should be handled. I've added a small change that might help. If it does not, could you post, or attach, 20 lines or so of your CVS input that has some rows with blanks. Right now, I am assuming that column A (31) may contain a blank value and that should be represented with a row that is labeled "BLANK". I also assume that if column B (36) has a blank, it is to be skipped.

As for learning to write code, my advice is to practice. Start with simple problems (this is certainly not simple!) and go slowly. Look at some of the posts on this site and see what code was offered to solve the problem. Initially try to understand what the programe does, and work towards reading the problem and writing the code yourself -- then comparing it with what was posted.

These are two good sites for helping learn awk:
Awk - A Tutorial and Introduction - by Bruce Barnett
Gawk: Effective AWK Programming - GNU Project - Free Software Foundation (FSF)

The second site has several different "presentations" of the same material.

jitendra.pat04 · August 11, 2011, 11:51pm

Thanx again for your post.....

Ok...I will try it again in my live environment.......and repost asap....

One more thing for each thing I have to try in my live env. sometimes which may be risky......

I have installed "ubuntu" in my windows virtual machine....but it has only bash shell default for ubuntu....

Can I have any such testbed as u r having to test all these codes at home etc...

aside to VM, its features are also not gud.....as I have to scroll everytime I had to take help from the NET....

any suggestion dear friend.....

jitendra.pat04 · August 20, 2011, 4:34am

Thanx dear friend agama....its working perfectly now.....

Also the (#) comments help me a lot in understanding your a million worthy post......

Thanx again....

GR88...

agama · August 20, 2011, 1:14pm

Agree, testing in your live environment is risky, but sometimes unavoidable.

I don't install windows on my computers, just some form of UNIX. I currently have FreeBSD running on a 10 year old laptop, and Linux running on my newer laptop; between these two it gives me the ability to try most things under a couple of environments. If you can find an older (not too old though) PC that someone will give you, or isn't too expensive, then I'd suggest installing Linux directly on it. You'll then have a system that you can experiment with and not have any worry about messing something up.

As for only having bash in your VM.... most scripts will run under both Korn shell and bash. I prefer Korn shell because it has several capabilities that bash doesn't. The script in this thread, while marked ksh in the #! statement, would run just fine under bash as it's all awk.

Glad you've gotten it working!