Need to Preprocess a text file and convert into csv

ajayram · August 21, 2015, 2:57am

Hello,

I was working with Machine learning and would like to apply my regression algorithms on binary classification datasets.

So I came across this adult dataset, LIBSVM Data: Classification (Binary Class)

It is a binary dataset, features have values only 1 and 0.

and I wanted to download and use it,. However it is not in CSV format.
It is in this format

-1 5:1 7:1 14:1 19:1 39:1 40:1 51:1 63:1 67:1 73:1 74:1 76:1 78:1 83:1 
-1 3:1 6:1 17:1 22:1 36:1 41:1 53:1 64:1 67:1 73:1 74:1 76:1 80:1 83:1 
-1 5:1 6:1 17:1 21:1 35:1 40:1 53:1 63:1 71:1 73:1 74:1 76:1 80:1 83:1 
-1 2:1 6:1 18:1 19:1 39:1 40:1 52:1 61:1 71:1 72:1 74:1 76:1 80:1 95:1 
-1 3:1 6:1 18:1 29:1 39:1 40:1 51:1 61:1 67:1 72:1 74:1 76:1 80:1 83:1 
-1 4:1 6:1 16:1 26:1 35:1 45:1 49:1 64:1 71:1 72:1 74:1 76:1 78:1 101:1 
+1 5:1 7:1 17:1 22:1 36:1 40:1 51:1 63:1 67:1 73:1 74:1 76:1 81:1 83:1 
+1 2:1 6:1 14:1 29:1 39:1 42:1 52:1 64:1 67:1 72:1 75:1 76:1 82:1 83:1 
+1 4:1 6:1 16:1 19:1 39:1 40:1 51:1 63:1 67:1 73:1 75:1 76:1 80:1 83:1 
+1 3:1 6:1 18:1 20:1 37:1 40:1 51:1 63:1 71:1 73:1 74:1 76:1 82:1 83:1 
+1 2:1 11:1 15:1 19:1 39:1 40:1 52:1 63:1 68:1 73:1 74:1 76:1 80:1 90:1

so the first line is the class variable, and the remaining part the row
indicates which columns are 1..

How do I convert this to a csv where the columns which are 0 also come ?
like for this input row -1 5:1 7:1 14:1 , i should get this output row

-1 0 0 0 0 1 0 1 0 0 0 0 0 0 1

Maybe a shell script with some awk programming would be needed.

Can someone help me out?

Don_Cragun · August 21, 2015, 3:46am

Yes, it is easy to do something like this with an awk script...

Normally, a CSV file would have the same number of fields in every output line. There doesn't seem to be anything in your input file that indicates how many fields are present in the data. (We know that it is at least 101 fields, but we have no idea how many zero fields should appear at the ends each output line.) How is your script supposed to determine the number of fields to include in the output?

A CSV file also usually uses a comma as the field separator, but you seem to want a space character as the field separator. Is that correct?

Will there ever be anything other than :1 at the ends of the input fields (other than the 1st field)? For example, if a line had a lot of ones and a few zeros, could the input use fields ending with :0 instead of :1 to produce a shorter input line? Is all of the data you want to process in this same format? For example, the "diabetes" files on the site you referenced are in a completely different format.

Will each input line have the same number of fields as in your sample input? Or, can the number of ones in input and output lines vary?

RudiC · August 21, 2015, 5:12am

While Don Cragun raises reasonable questions that wait to be answered, here a solution to the request as is in post#1:

awk '
        {printf "%s", $1
         C=0
         for (i=2; i<NF; i+=2)  {for (; ++C < $i;) printf " 0"
                                 printf " 1"
                                }
         printf "\n"
        }
' FS="[ :]" file

Can be easily adapted to altered specifications based on answers to above questions.

ajayram · September 11, 2015, 9:29am

Hi,

I missed the number of fields. It is 123. So 123 features i.e. there are 123 binary values, and then a class variable which could be +1 or -1.

And each line contains only 0 or 1 or the class variable (which is + 1 or -1) and that class variable appears only once at the beginning of each line.

Hope that clarifies all the doubts.

And yes I want a CSV i.e. values separated with commas.

Don_Cragun · September 11, 2015, 2:29pm

The following seems to do what you want:

awk -v nof=123 -v OFS=, '
{	printf("%s%s", $1, OFS)
	f = 1
	for(i = 2; i <= NF; i++) {
		while(f < $i + 0 && f <= nof)
			printf("0%s", f++ < nof ? OFS : ORS)
		if(f == $i + 0 && f <= nof)
			printf("1%s", f++ < nof ? OFS : ORS)
	}
	while(f++ <= nof)
		printf("0%s", f <= nof ? OFS : ORS)
}' file