Text processing of file

ajayram · November 25, 2011, 12:08pm

I have a text file which is a dataset. and I need to convert it into a CSV format
The file is as follows :
First line :

-1 3:1 11:1 14:1 19:1 39:1 42:1 55:1 64:1 67:1 73:1 75:1 76:1 80:1 83:1

Second line "

+1 5:1 11:1 15:1 32:1 39:1 40:1 52:1 63:1 67:1 73:1 74:1 76:1 78:1 83:1

There are a total of 123 columns, of which only the ones which have value 1 are shown here. the remaining columns are 0 s.

So I would like a CSV file of the following format :" with the -1 in the beginning of the row replaced with 0 and +1 replaced with 1

0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 .... 
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 .....

Can anyone help me out?

CarloM · November 25, 2011, 12:23pm

awk '{
   if ($1=="1") {
      printf "1"
   } else {
      printf "0"
   }
   prev=1;
   for (i=2;i<=NF;i++) {
      split ($i, colval, ":");
      for (j=prev+1; j<colval[1]; j++) {
         printf ",0"
      }
      printf ",%s", colval[2];
      prev=$i
   }
   for (j=prev+1; j<123; j++) {
         printf ",0"
   }
   printf "\n";
}' inputfile

ajayram · November 26, 2011, 8:21am

Hello,

I am getting uneven no of values per row in the CSV file
There are 123 features.. so I guess there would be 123 0's 0r 1.
and then the class label -1 or +1 also converted to 0 or 1

I am attaching the two sample lines of the input file and the output file...

---------------------------------------------------------------------------------------------------
Input file - two lines :

-1 3:1 11:1 14:1 19:1 39:1 42:1 55:1 64:1 67:1 73:1 75:1 76:1 80:1 83:1 
-1 1:1 6:1 14:1 22:1 36:1 42:1 49:1 64:1 67:1 72:1 74:1 77:1 80:1 83:1

and the current outputs to these two lines are :

0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

AS you may be able to noitce , there are uneven no of rows !!

ahamed101 · November 26, 2011, 9:11am

Try this...

awk '{
        sub(/-1/,0)?NULL:sub(/+1/,1)
        printf $1
        for(i=2;i<=NF;i++){
                split($i,arr,":")
                for(j=last+1;j<arr[1];j++) printf " 0"
                printf " "arr[2]
                last=arr[1]
        }
        for(i=last+1;i<=123;i++) printf " 0"
        last=0;printf "\n"
}' input_file

--ahamed

danmero · November 26, 2011, 11:02am

awk '{x=($1<0)?0:(($1>0)?1:$1);split($0,a);for(i=1;++i<=NF;){split(a,b,":");$(b[1]+1)=b[2]};for(i=1;++i<=123;){$i=$i==1?1:0};$1=x}1' OFS="," file

CarloM · November 28, 2011, 5:34am

There were a couple of errors (or 4) in my previous solution. :o

{
   if ($1=="+1") {
      printf "1"
   } else {
      printf "0"
   }
   prev=0;
   for (i=2;i<=NF;i++) {
      split ($i, colval, ":");
      for (j=prev+1; j<colval[1]; j++) {
         printf ",0"
      }
      printf ",%s", colval[2];
      prev=colval[1];
   }
   for (j=prev+1; j<=123; j++) {
         printf ",0"
   }
   printf "\n";
}

ajayram · January 26, 2012, 4:34am

Hello,

Thanks a lot. It is working properly now. I will mark this thread as solved.