Extract certain columns from big data

happypoker · June 14, 2013, 2:48pm

The dataset I'm working on is about 450G, with about 7000 colums and 30,000,000 rows.
I want to extract about 2000 columns from the original file to form a new file.
I have the list of number of the columns I need, but don't know how to extract them.
Thanks!

joeyg · June 14, 2013, 2:55pm

Is the file delimited (like with commas between field) or fixed record length?

rveri · June 14, 2013, 2:56pm

happypoker,
You can use gnu awk and use array for the desired columns to store and print the data.

happypoker · June 14, 2013, 2:57pm

Sorry I didn't make it clear.
Yes, the file is tab delimited.

rveri · June 14, 2013, 3:21pm

If you can post a sample data it may be helpful to determine the actual code.

DGPickett · June 14, 2013, 3:41pm

You want some columns, all rows, not keyed by some columns, some values? I'd say 'cut' but I have found cut disappointing in speed in some situations. No chance the rows or columns are fixed length? I'd go with C, mmap() and a 64 bit compile, but awk sounds like a good fit: field oriented and delimiter configurable.

Tools like Ab Initio would divide the file into N byte blocks and process it in parallel. You could do the same, if you write a script that seeks and only goes so far, but it gets tricky ensuring the row containing byte N+1 is processed by the first instance, as it must be ignored by the second instance, since it might be (and probably is) partial and belongs to the prior instance. Or not, you get to decide who owns that row. My thought is that every instance goes N+1 bytes, ignoring the first line if not the first instance, and beyond N+1 to get the end of the last line.

Welcome to big data.

happypoker · June 14, 2013, 5:29pm

Thank you! I'm able to get the list column # into an array, but still don't know how to print those corresponding columns out.

The data is something like:

000 A B C D 2.22 3.4 1.03 .....
001 B F S D 3.2 2.1 3.2 2.3 .....
....
...
132 F S F X 2.3 3.4 5.3 2.1 ....

It has about 6600 columns and 30,000,000 lines. It's too big to put the needed columns in some array, I have to print the needed columns line by line.

rveri · June 14, 2013, 6:27pm

Happypoker,
check it out..

# cat file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1

-- Put into array & print:

$ awk '{for (i=1;i<=NF;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1

Now you can control the amount of column output you want, with the value of i ,
here i=5 taken.

$ awk '{for (i=1;i<=5;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D
001 B F S D
132 F S F X

cheers,

---------- Post updated at 06:27 PM ---------- Previous update was at 06:04 PM ----------

Hi Happypoker,

Adding to that, if you want to print a range of column , say from

start=100 , to end=6700

you can do it as below:

$ s=100;e=6700
$ awk -v s1=$s -v e1=$e '{for (i=1;i<=e1;i++) A[NR,i]=$i;for(j=s1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file

Enjoy..,Have fun!.

MadeInGermany · June 14, 2013, 6:46pm

Why store in array?
Print directly

awk '{for (i=start;i<=end;i++) printf "%s ",$i; printf "\n"}' start=100 end=6700 file

With cut

cut -f 100-6700 file

Don_Cragun · June 15, 2013, 2:44am

The standards only define the behavior of awk when its input files are text files. A file with 6,600 fields isn't likely to be a text file an any UNIX or Linux system I've seen. The maximum length of a line in a text file is LINE_MAX bytes including the terminating newline character. (You can get the value of LINE_MAX on your system using the command:

getconf LINE_MAX

The standards allow LINE_MAX to be as low as 2,048 bytes.) Some implementations of awk may accept longer lines and behave as you would like them to. Others will print a diagnostic if an input or output line exceeds LINE_MAX. Others will silently truncate long lines (in this case probably provided truncated output lines). And, others may read LINE_MAX bytes, treat that as a line, and then read the next LINE_MAX bytes as the next input line (guaranteeing garbage output for your application.) Note that if you have an awk that handles long lines as you want it to and try to create arrays of 2,000 fields from 30,000,000 input records and then try printing the results at the end, you'd need for awk to have access to a minimum of 600,000,000,000 bytes of data to store that data even if each field is only one byte long (one byte of data + a terminating null byte + an 8 byte pointer to the string for each field). With data of this magnitude, you will have to process it on the fly; not accumulate data and process it when you find the end of your input file.

The standards do require that conforming implementations of the cut utility be able to handle arbitrary line lengths (assuming that they have access to the memory they need to hold lines being processed). The standards also require that fold (with the -b option) and paste be able to break apart and recreate files that would be text files except for unlimited line lengths, and cat and wc can work on files of any size and type. All other standard text processing utilities (e.g., awk, ed/ex, grep, sed, vi, etc.) have unspecified or undefined behavior if their input files are not text file with lines no longer than LINE_MAX.

MadeInGermany · June 15, 2013, 2:27pm

For really big data, here is a C program:

#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int cnt = 0;
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  fin = fopen(argv[3],"rb");

  while ((c = getc(fin)) != EOF)
  {
     if (c == IFS) ++cnt;
     else if (c == '\n') { cnt=0; putc(c, stdout); }
     if (cnt >= start && cnt <= end) putc(c, stdout);
  }
  fclose(fin);
}

gcc -o thisprogram thisprogram.c
./thisprogram 100 6700 file

alister · June 15, 2013, 3:04pm

Compared to the way that awk and cut index fields, the C code is off by one; it is 0-indexed instead of 1-indexed. So the 100-6700 specified is actually 101-6701 in awk/cut.

If the first field (start=0) is part of the range, every newline will be duplicated in the output.

If the first field is not part of the range (start > 0), there will always be a leading IFS character.

Regards,
Alister

MadeInGermany · June 16, 2013, 10:08am

The duplicate newline printing was a bug.
With field numbers starting at 1, and no leading IFS character:

#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  int field = 1;
  int first = 0;
  fin = fopen(argv[3], "rb");

  while ((c = getc(fin)) != EOF)
  {
    if (c == '\n') { field=1; first=0; putc(c, stdout); }
    else {
      if (c == IFS) ++field;
      if (field >= start && field <= end && first++) putc(c, stdout);
    }
  }
  fclose(fin);
}

rveri · June 17, 2013, 12:35am

MadeInGermany ,
I tried the C program after compiled , it doesnt work for me somehow , tried in Ubuntu.

# ./thisprogram 100 200 col8k



ubuntu:#

It prints nothing. Just prints 4 blank line.

MadeInGermany · June 17, 2013, 1:40am

It's for tab-separated file.
For space-separated file you must change IFS from '\t' to ' ' and recompile.
With cut this would be

cut -d ' ' -f 100-200 col8k