How to extract a subset from a huge dataset

cliffyiu · March 13, 2010, 11:06am

Hi, All

I have a huge file which has 450G. Its tab-delimited format is as below

x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
  chomp($line);
  @array = split(/\t/,$line);

  if ($array[0] eq 'x10')
  {
    if (($array[2] >= 600000) && ($array[2] <= 26279795))
    {
      open (OUT, ">>$file2");
      print OUT "$line\n";
      close OUT;
    }
  }
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

EAGL · March 13, 2010, 11:25am

nawk -F"[\t]" '$1~/x10/ && $3>600000  && $3<30000000'  FILE > SubFILE

cliffyiu · March 13, 2010, 12:10pm

Hi,Eagle

Thanks for your reply. I just tried your command but it failed. It said

-bash: nawk: command not found

it seems like we don't have nawk in our server.

Do you have other idea? can I just use awk?

Franklin52 · March 13, 2010, 12:54pm

Try awk instead or /usr/xpg4/bin/awk on Solaris:

awk '$1=="x10" && $3>600000 && $3<30000000'  FILE > SubFILE