HELP with Unix scripts in summing columns in a file

ramneim · August 21, 2012, 11:42am

The problem statement, all variables and given/known data:
Hi guys, i'm a new guy here, and it's my first time creating a unix script.
can you guys help me out here? i'd really appreciate it.

Here's my problem:

This is the file i'm using, it has 6 columns, the first three columns are like the key columns, and i put one header row. I need to sum the values of 4th, 5th and 6th row, by grouping each row by date, prod id and store id.

1st file:

date		prod id	store id	amt 1	      amt 2	    amt 3
20111017	45671	11801	-12	      -12348	    1080
20111017	45671	11801	12	      12348	    1030
20111018	45672	11802	7	      7868	    990
20111018	45672	11802	4	      4496	    950
20111019	45673	11803	-12	      -12348	    910
20111019	45674	11804	12	      12348	    866

then i have this second file:
this file contains a list of "prod id's" that should be ignored in the first file.

2nd file:

prod id
45671

The third file consists of a list of "store id's" that should be ignored from the 1st file.

3rd file:

store id
11803

So basically, i need to come up with a script to sum the 4th, 5th and 6th columns of the first file, while taking consideration, those that are present in the 2nd and 3rd file, so that those that should be ignored should not be included when columns are summed.

I would really appreciate a lot of love and help and mercy from you guys.

Thanks a lot!

Relevant commands, code, scripts, algorithms:
UNIX scripts, awk, printf, $ and all those stuff
The attempts at a solution (include all code and scripts):
i tried tweaking some scripts i found across the net, but im basically lost.
Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):
i'm a computer engineering student from the phil., a noob one at that matter, my professor's engr mendoza.

Note: Without school/professor/course information, you will be banned if you post here! You must complete the entire template (not just parts of it).

Corona688 · August 21, 2012, 11:57am

We can't just hand you an answer. We can help you build something but can't do the whole thing.

You can sum in awk like this:

awk '{
        for(N=4; N<=NF; N++) A[$1","$2","$3","N]+=$N
        B[$1","$2","$3]++
 }
END {
        for(X in B)
        for(N=4; N<=8; N++)
                print X, A[X","N]
}' datafile

If you make a BEGIN section, you can load lines from your other files with getline <"filename" statements to load the variables you want to skip later.

ramneim · August 21, 2012, 11:59am

i have this script that basically sums two columns in a file, but does not group data.. and does not consider records to be ignored, so i wanted to add that function to improve this script..

it goes like this:

FILENAME=$1
DELIMITER=$2
FIRST_COL=$3
SECOND_COL=$4

for FILE in ${FILENAME}
do
  echo ${FILE}

  #########################################
  #GET SUMS OF THE 2 COLUMNS
  #########################################
  awk -v col=$FIRST_COL -v col2=$SECOND_COL -F "$DELIMITER" '
  {
    if(NR!=1) #comment this line out if you want the first row included
    {
        col1Sum += $col
        col2Sum += $col2
    }
  }
  END{
    printf("\t%s\t%s\t%s\n","", "Col1", "Col2")
    printf("\t%s\t%d\t%d\n","Sums:", col1Sum, col2Sum)
    printf("\t%s\t%s\t%s\n","", "", "")
  }' ${FILE}

  #########################################
  #GET COLUMNS
  #########################################
  awk -F  "${DELIMITER}" '{print "\tNo of Columns: " NF}' ${FILE} | sort | uniq
  
  #########################################
  #GET ROWS
  #########################################
  let ROWS=`cat ${FILE} | wc -l`-1
  echo "\tNo of Rows: ${ROWS}"

done

Corona688 · August 21, 2012, 12:11pm

You can use arrays to group the data. Instead of X += Y, you can do X[index] += Y, where 'index' can be one column or several columns, like X[$1","$2","$3] += $4

Then you can loop over the array like for(Z in X) to get every array index in X, and access the contents like X[Z]

Corona688 · August 21, 2012, 12:13pm

You can also make a BEGIN { } section that works sort of like the END { } section but opposite, and use it to load data from other files before the rest of the program starts processing lines.

You can load lines from those files like while(getline<"filename" >0) { ... } and lines will appear in the $1 $2 ... variables like you'd expect.

ramneim · August 21, 2012, 12:15pm

hmm okay, how about the scenario where i want some rows to be ignored when they are summed?

how can i add to the script, where in it also references to the 2nd and 3rd file, to look for some values and compare it with 1st file, and if there's a match, those matches will be excluded from summation.. .?

Corona688 · August 21, 2012, 12:17pm

Load data from the files in the BEGIN section and put them in an array.

When reading lines of data, check then if the array for that entry is blank like if(SKIP[$col]) { next } where 'next' will cause that line to be skipped.

ramneim · August 21, 2012, 12:23pm

okay, i'm taking all your suggestions into consideration, coz I basically don't know what's the best thing to do here, so

in that BEGIN section, is it okay if the file that i will load there is coming from other directories?

and also, there's a possibility that there's no 2nd or 3rd file, so that BEGIN section should be in an if statement, right?

Corona688 · August 21, 2012, 12:26pm

You can't put a BEGIN statement in an if statement. Like END, it has to be outside by itself. You can put an if statement in a begin statement though.

Yes, you can specify paths to getline. while(getline<"/path/to/filename" > 0) { }

If the files don't exist, the loops will just be skipped...

ramneim · August 21, 2012, 12:38pm

oh yeah, got it, forgot about that, so i'll just put an if statement in the begin section, to check if there's a 2nd or 3rd file.

i'm actually getting the hang of it now, that getline suggestion is really cool.

but let's say, there's a 2nd and 3rd file, and then their data got loaded before doing the processing/summation, so how do i exclude/skip those lines so that they won't be included in summation?

like what should i put in the codes to exclude those from being summed?

PS i'm really grateful to your replies, it's like i'm getting the hang of it somehow on how to proceed, thanks alot

Corona688 · August 21, 2012, 12:43pm

The getline loop won't run if the file doesn't exist. An if might not be necessary.

Put them in the array, too. If you somehow arrange for B["12345"]=1 to be set before the real data gets checked, you can check if B["12345"] is true and skip it if it is.

ramneim · August 21, 2012, 12:57pm

hmm okay, thanks a lot for your help! it's getting pretty late here already, it's 1 in the morning, and im already sleepy, i'll add these in my script, and maybe show it to you tomorrow, if i bump into problems again

but thanks a lot! you are a great help to everyone in this forum!

bakunin · August 21, 2012, 3:54pm

I think you already get the best advice possible in terms of implementation, so i will concentrate on the intrinsics of the problem:

Basically you have to parse a file, do a group change and reduce your input set by a set of key values. All these procedures/algorithms are basic tasks which you will have to handle on a daily basis in one or the other form in your future job. Probably this is why your professor designed the task this way.

Here is what a group change is and how it is carried out:

You have a file/table/datastructure with basically two columns: a "key" column and a "value" column. You sort that file for the key, then read it line by line and every time the key value changes you have to carry out some action.

For instance: Every customer of a business, instead of paying, will have written a record in a book (=file): his name and the value of the good he takes. At the end it looks like:

customerA  50,-
customerA  25,-
customerB  70,-
customerA  15,-
customerC  60,-
customerB  25,-
...

The task is to sum up the total for every customer and print a list with the totals. This is how it is done:

First sort the file for the key value, which is the first field in this case (the customer name):

customerA  50,-
customerA  25,-
customerA  15,-
customerB  70,-
customerB  25,-
customerC  60,-
...

Now read that file line by line. Store the last value of your key into a buffer. Optionally you might have an default action for every line, which be carried out here - in this case sum up the value. Whenever the key changes the comparison of the new key to the last key in this buffer will fail. Then you have to do your action (in this case print the total) and start over with the new key stored in the buffer. (Don't forget that at files end you have to end-process your last key value.)

Here is it in pseudo-code:

LAST=""
while (lines to process)
     KEY := KEY value of the line
     VAL := VALUE value of the line
     line_process( KEY, VALUE )
     if ( KEY differs from LAST )
          group_end_process( KEY [, ..... ] )
          LAST := KEY
     endif
     next line
end while
end_process( KEY [, ..... ] )

This is called a "single group change". There are also multiple group changes: suppose the customers would want a detailed statement, where the goods are sorted in groups and a subtotal should be given for every group of goods and a grand total for the customer (dual group change). It will need a second key field and i am sure you could write the pseudo code for this one already yourself.

I hope this helps in understanding.

(corollary: read carefully the man page of "awk", especially the part about how awk works and how "awk" scripts are formed. Does that ring a bell?)

bakunin