MATRIX to CSV

kraterions · June 18, 2013, 12:17pm

Hello friends,

A big question for the UNIX INTELLIGENCE

I have a CSV file as follows:

VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userG
1,userB,userJ,userU
-1,userJ,userD,userA

There are a lot of user and their frequences and positions are random in every field!

I need as a output file a CSV as follows:

SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE

Where:
SUPERUSER= most frequent user (in column USER1) in terms of relations to other users (descendent order)
relatedUSER= most frequent user in terms of relations to USER1
TOTFrequencyrelations = total number of relations between SUPERUSER and relatedUSER
(1)Frequency = amount of positive relations
(0)Frequency = amount of neutral relations
(-1)Frequency = amount of negative relations
Value = (1)Frequency-(-1)Frequency/[(1)Frequency+(-1)Frequency)]

Example:

SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userG,userB,15,6,7,1,0....
userG,userF,10,2,4,4,0...
userG,userD,5,2,2,1,0...
userJ,userH,15,6,6,3,0...
userJ,userU,8,4,1,1,0....
userJ,userB,6,2,2,1,0....
userF,userA,12,4,5,3,0.....
userF,userH,12,5,2,5,0.....

Many thanks for your big help.

Have a nice time!!

DGPickett · June 19, 2013, 3:02pm

The first thing that occurs to me is that the input is a columns-should-be-rows flavor, so turn it into proper tuples by making a row/line/tuple for each related. Then you can deal with it like a SQL RDBMS table. You can then cut, sort and uniq -c or while read the various columns to get ranks and statistics. For instance, assuming there are always 2 related and they are peers, here is a partial solution (forgot VALUE, discarded TOTFrequencyrelations f2 -- see if you can fix it):

(
  cut -d, -f 2,3,1 in_file
  cut -d, -f 2,4,1 in_file
 ) | (
  sort
  echo ZZZEND,ZZZEND
 ) | (
  IFS=${IFS}, lrusr=
  while read usr rusr pnz
   do
 
    if [ "$lusr" = "$usr" ]
     then
 
      (( uct++ ))
 
      if [ "$lrusr" = "$rusr" ]
       then
 
        (( ruct++ ))
 
       fi
     fi
 
    if [ "lusr-$lrusr" = "$usr-$rusr" ]
     then
 
      case "$pnz" in
       (-1)
         (( pnzn++ ))
        ;;
       (0)
         (( pnzz++ ))
        ;;
       (*)
         (( pnzp++ ))
        ;;
       esac
 
      continue
     fi
 
    if [ "$lrusr" != "" ]
     then
 
      echo "$uct,$ruct,$usr,$rusr,$pnzp,$pnzz,$pnn"
 
     fi
 
    if [ "$lrusr" = "ZZZEND" ]
     then
 
      break
 
     fi
 
    uct=1 ruct=1 pnzz=0 pnzp=0 pnzn=0 lusr=$usr lrusr=$rusr
   done
 ) | sort -nr -t, | cut -d, -f 3-7

This cuts out fields for related1 and then for related2 into a common stream and feeds it to a sort, creating a sorted tuple file on the stream. The while read loop counts the user and user+related counts for sort ordering, and the +1 p, -1 n and 0 z, spitting out counts when the key changes. A dummy input trailer makes the loop spit out the last set of values. Once sorted numerically, the counts for sorting are discarded.

Don_Cragun · June 19, 2013, 3:31pm

I am having great difficulty trying to figure out what you want. I don't see any way to convert the data you said is in your CSV file into the example output you supplied using the descriptions of the output fields you provided (even ignoring the fact that the mathematical calculation you specified will NEVER produce the ... , .... , or ..... you show in the last field of your expected output). The fields you have labeled (xxx)Frequency appear to be counts in your output rather than frequencies; but neither term matches the data shown in your example.

Please provide sample input, desired output for that sample input, and a description of the processing you want to use to convert the input into the output that is consistent with the sample data.

kraterions · June 19, 2013, 6:52pm

Hello,

many thanks for your attention,

i will try to explain what i need with a real case:

Example input:

VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userU
1,userB,userJ,userU
-1,userJ,userD,userA
-1,userA,userB,userC
-1,userN,userD,userB
0,userN,userH,userG

Example output:

SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userN,userB,3,2,0,1,0.3
userN,userD,2,2,0,0,1
userN,userA,1,1,0,0,1
userN,userH,1,0,1,0,0
userN,userG,1,0,1,0,0
userA,userC,3,0,0,3,-1
userA,userB,2,0,0,2,-1
userA,userF,1,0,0,1,-1
userF,userH,2,0,2,0,0
userF,userU,1,0,1,0,0
userT,userH,1,0,1,0,0
userT,userB,1,0,1,0,0
userJ,userD,1,0,0,1,-1
userJ,userA,1,0,0,1,-1

Example Values for userN and userB = 3,2,0,1,0.3
3 = number of total relations
2 = number of total positive relations
0 = number of total neutral relations
1 = number of total negative relations
0.3 = value obtain from this formula 2-1/(2+1)=0.3

I hope that this simple example will be usefull!

Thanks again!

Yoda · June 19, 2013, 8:41pm

Not sure if this can be done much simpler, anyway here is an awk approach:

awk -F, '
        BEGIN {
                print "SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE"
                OFMT = "%.1f"
        }
        NR > 1 {
                U[$2]
                U[$3]
                U[$4]
                R[$2","$3]++
                R[$2","$4]++
                R[$3","$2]++
                R[$3","$4]++
                R[$4","$2]++
                R[$4","$3]++
                if ( $1 > 0 )
                {
                        P[$2","$3]++
                        P[$2","$4]++
                        P[$3","$2]++
                        P[$3","$4]++
                        P[$4","$2]++
                        P[$4","$3]++
                }
                if ( $1 == 0 )
                {
                        N[$2","$3]++
                        N[$2","$4]++
                        N[$3","$2]++
                        N[$3","$4]++
                        N[$4","$2]++
                        N[$4","$3]++
                }
                if ( $1 < 0 )
                {
                        L[$2","$3]++
                        L[$2","$4]++
                        L[$3","$2]++
                        L[$3","$4]++
                        L[$4","$2]++
                        L[$4","$3]++
                }
        }
        END {
                for ( k in U )
                {
                        for ( l in R )
                        {
                                split ( l, V )
                                if ( V[1] == k )
                                {
                                        p = P[l] ? P[l] : 0
                                        n = N[l] ? N[l] : 0
                                        z = L[l] ? L[l] : 0
                                        print V[1], V[2], R[l], p, n, z, ( ( p + z ) > 0 ? ( p - z ) / ( p + z ) : 0 )
                                }
                        }
                }
        }
' OFS=, file

Don_Cragun · June 19, 2013, 9:17pm

In normal arithmetic, multiplication and division have higher precedence than addition and subtraction. So the formula 2-1/(2+1) (with the results rounded to one decimal place would be 1.7 ; not 0.3 . Did you intend for the formula (using alternative variable names that don't contain parentheses) to be:

Value = (Plus1Count-Minus1Count)/(Plus1Count+Minus1Count)
        rather than
Value = Plus1Count-Minus1Count/(Plus1Count+Minus1Count)

or should the result be 1.7 ?

DGPickett · June 20, 2013, 9:28am

I like parens even if not needed due to priority, as so few seem to remember priority rules.

Yoda takes user, rel user 1 and rel user 2 to be peers, and does relationships trilaterally, bidirectionally. Does that fit this model?

Is zero the right answer for the divide by zero case? Maybe zero cases should be in the denominator, preventing that?

durden_tyler · June 20, 2013, 10:56am

$
$ cat input
VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userU
1,userB,userJ,userU
-1,userJ,userD,userA
-1,userA,userB,userC
-1,userN,userD,userB
0,userN,userH,userG
$
$
$ perl -lne 'next if $. == 1;
             ($n, $u, $x, $y) = split/,/;
             $mtx{"$u,$x"}->[0]++;
             $mtx{"$u,$y"}->[0]++;
             if ($n > 0)    { $mtx{"$u,$x"}->[1]++; $mtx{"$u,$y"}->[1]++ }
             elsif ($n < 0) { $mtx{"$u,$x"}->[3]++; $mtx{"$u,$y"}->[3]++ }
             else           { $mtx{"$u,$x"}->[2]++; $mtx{"$u,$y"}->[2]++ }
             END {
               print "SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE";
               for $k (sort keys %mtx) {
                 $pos = $mtx{$k}->[1] // 0;
                 $zro = $mtx{$k}->[2] // 0;
                 $ngt = $mtx{$k}->[3] // 0;
                 $val = ($pos + $ngt) == 0 ? 0 : ($pos - $ngt)/($pos + $ngt);
                 print "$k,$mtx{$k}->[0],$pos,$zro,$ngt,$val";
               }
             }
            ' input
SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userA,userB,2,0,0,2,-1
userA,userC,3,0,0,3,-1
userA,userF,1,0,0,1,-1
userB,userJ,1,1,0,0,1
userB,userU,1,1,0,0,1
userF,userG,1,0,1,0,0
userF,userH,2,0,2,0,0
userF,userU,1,0,1,0,0
userJ,userA,1,0,0,1,-1
userJ,userD,1,0,0,1,-1
userN,userA,1,1,0,0,1
userN,userB,3,2,0,1,0.333333333333333
userN,userD,2,1,0,1,0
userN,userG,1,0,1,0,0
userN,userH,1,0,1,0,0
userT,userB,1,0,1,0,0
userT,userH,1,0,1,0,0
$
$

kraterions · June 21, 2013, 3:37am

Hello guys,

here the output of yoda code:

SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userT,userH,1,0,1,0,0
userT,userB,1,0,1,0,0
userU,userB,1,1,0,0,1
userU,userF,1,0,1,0,0
userU,userH,1,0,1,0,0
userU,userJ,1,1,0,0,1
userA,userJ,1,0,0,1,-1
userA,userN,1,1,0,0,1
userA,userB,3,1,0,2,-0,3
userA,userC,3,0,0,3,-1
userA,userD,1,0,0,1,-1
userA,userF,1,0,0,1,-1
userB,userA,3,1,0,2,-0,3
userB,userC,2,0,0,2,-1
userB,userD,2,1,0,1,0
userB,userH,1,0,1,0,0
userB,userJ,1,1,0,0,1
userB,userN,3,2,0,1,0,3
userB,userT,1,0,1,0,0
userB,userU,1,1,0,0,1
userC,userA,3,0,0,3,-1
userC,userB,2,0,0,2,-1
userC,userF,1,0,0,1,-1
userD,userA,1,0,0,1,-1
userD,userB,2,1,0,1,0
userD,userJ,1,0,0,1,-1
userD,userN,2,1,0,1,0
userF,userA,1,0,0,1,-1
userF,userC,1,0,0,1,-1
userF,userG,1,0,1,0,0
userF,userH,2,0,2,0,0
userF,userU,1,0,1,0,0
userG,userF,1,0,1,0,0
userG,userH,2,0,2,0,0
userG,userN,1,0,1,0,0
userH,userB,1,0,1,0,0
userH,userF,2,0,2,0,0
userH,userG,2,0,2,0,0
userH,userN,1,0,1,0,0
userH,userT,1,0,1,0,0
userH,userU,1,0,1,0,0
userJ,userA,1,0,0,1,-1
userJ,userB,1,1,0,0,1
userJ,userD,1,0,0,1,-1
userJ,userU,1,1,0,0,1
userN,userA,1,1,0,0,1
userN,userB,3,2,0,1,0,3
userN,userD,2,1,0,1,0
userN,userG,1,0,1,0,0
userN,userH,1,0,1,0,0

DGPickett, you are right, doesn�t fit the model

many thanks for you attention

---------- Post updated at 02:37 AM ---------- Previous update was at 02:28 AM ----------

Hi durden_tyler,

here the output of your code:

String found where operator expected at script.pl line 19, near "'"
  (Might be a runaway multi-line '' string starting on line 2)
	(Missing semicolon on previous line?)
syntax error at script.pl line 19, near "'"
Execution of script.pl aborted due to compilation errors.

many thanks for your help!

durden_tyler · June 21, 2013, 7:22am

kraterions:

...
Hi durden_tyler,

here the output of your code:

String found where operator expected at script.pl line 19, near "'"
  (Might be a runaway multi-line '' string starting on line 2)
   (Missing semicolon on previous line?)
syntax error at script.pl line 19, near "'"
Execution of script.pl aborted due to compilation errors.

many thanks for your help!

Hi kraterions,

My post does not involve any file called "script.pl". But yours does.
I ran a Perl one-liner on the shell prompt.
But you put my Perl one-liner inside a Perl script called "script.pl" and tried to execute that.
So, essentially you are trying to call the perl interpreter from within a Perl program.
Obviously, it doesn't work that way.

Thanks,
tyler_durden

kraterions · June 21, 2013, 3:59pm

Hi durden_tyler,

ok perfect, your script works fine,

do you know how to use it as a script?

perl script.pl inputfile.csv > outputfile.csv

Thanks for your help!

DGPickett · June 21, 2013, 5:04pm

If you trim out some of the lines in Yoda's code, it should be fine. What was the answer on divide by zero for all zero interactions?

kraterions · June 21, 2013, 5:46pm

Hi DGPickett,

yoda's coda doesn�t works fine as you can see in the output sample.

I guess that durden_tyler's code fits my needs.

Thanks for your attention.

durden_tyler · June 22, 2013, 8:18am

$
$ cat inputfile.csv
VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userU
1,userB,userJ,userU
-1,userJ,userD,userA
-1,userA,userB,userC
-1,userN,userD,userB
0,userN,userH,userG
$
$ cat -n script.pl
     1  #!/usr/bin/perl -w
     2  # ----------------------------------------------------------------------------------------------------------
     3  # Note that the hash %mtx has keys that look like this: "X,Y", where X is USER1 in the input file and Y
     4  # is either relatedUSER1 or relatedUSER2. Each key value is a pointer to an array that looks like this:
     5  # [ n1, n2, n3, n4 ]
     6  # Here, n1 = relationship count, n2 = +ve relation count, n3 = neutral relation count and
     7  # n4 = -ve relation count
     8  # ----------------------------------------------------------------------------------------------------------
     9  use strict;
    10  my %mtx;                                                          # a hash for storing the matrix information
    11  # the header line
    12  my $header = "SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE";
    13  my $file = $ARGV[0];                                              # first command-line arg is the file name
    14  open(FH, "<", $file) or die "Can't open $file: $!";               # try opening it, die if unsuccessful
    15  while (<FH>) {                                                    # loop through file
    16    next if $. == 1;                                                # skip the first line
    17    chomp;                                                          # remove the EOL character
    18    my ($n, $u, $x, $y) = split/,/;                                 # tokenize the line; assign local variables
    19    $mtx{"$u,$x"}->[0]++;                                           # tokens (2,3) and (2,4) are the keys
    20    $mtx{"$u,$y"}->[0]++;                                           # increment their relation counts
    21    if ($n > 0)    { $mtx{"$u,$x"}->[1]++; $mtx{"$u,$y"}->[1]++ }   # increment the counts of +ve/0/-ve relations
    22    elsif ($n < 0) { $mtx{"$u,$x"}->[3]++; $mtx{"$u,$y"}->[3]++ }   # To get a better idea of what the hash %mtx
    23    else           { $mtx{"$u,$x"}->[2]++; $mtx{"$u,$y"}->[2]++ }   # looks like, try Data::Dumper after file close
    24  }                                                                 # done looping
    25  close (FH) or die "Can't close $file: $!";                        # close the file or die if unsuccessful
    26  print $header, "\n";                                              # start with printing the header first
    27  for my $k (sort keys %mtx) {                                      # sort the hash keys and loop through them
    28    my $pos = $mtx{$k}->[1] // 0;                                   # if a relation count is undefined
    29    my $zro = $mtx{$k}->[2] // 0;                                   # or non-existent
    30    my $ngt = $mtx{$k}->[3] // 0;                                   # set it to 0
    31    my $val = ($pos + $ngt) == 0 ? 0 : ($pos - $ngt)/($pos + $ngt); # derive the value at the farthest position
    32    print "$k,$mtx{$k}->[0],$pos,$zro,$ngt,$val\n";                 # and print everything
    33  }                                                                 # we're done
$
$ perl script.pl inputfile.csv
SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userA,userB,2,0,0,2,-1
userA,userC,3,0,0,3,-1
userA,userF,1,0,0,1,-1
userB,userJ,1,1,0,0,1
userB,userU,1,1,0,0,1
userF,userG,1,0,1,0,0
userF,userH,2,0,2,0,0
userF,userU,1,0,1,0,0
userJ,userA,1,0,0,1,-1
userJ,userD,1,0,0,1,-1
userN,userA,1,1,0,0,1
userN,userB,3,2,0,1,0.333333333333333
userN,userD,2,1,0,1,0
userN,userG,1,0,1,0,0
userN,userH,1,0,1,0,0
userT,userB,1,0,1,0,0
userT,userH,1,0,1,0,0
$
$

Yoda · June 22, 2013, 11:29am

My code is not giving desired output because I assumed the relationship is trilateral as DGPickett mentioned in his previous post.

So you have to remove few lines from my code to fit your need. I hope this helps.

kraterions · June 23, 2013, 4:52am

HI durden_tyler's

amzing, i will try it as soon as possible..

have a nice day