awk - splitting 1 large file into multiple based on same key records

kam66 · January 18, 2011, 4:38pm

Hello gurus,

I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files.

e.g. my data is like:

Row_Num, Field1 (key), Field 2, Field3,......
 
500000, 100 , ABC, 10A --> goes into "file1"
500001, 100, DEF, 20A --> should also go in "file1" instead of "file2"
500002, 200, GHI, 30A --> should be the 1st record written in "file2"

In the above example after checking the NR reaching half million I also want to check the key on each line and match it with the key value of the previous line, if found same I would like to add this record in same output file instead of sending it to a new file.

Any help will be highly appreciated.

Thanks

Corona688 · January 18, 2011, 5:18pm

What is your system? Different systems often have radically different versions of awk.

kam66 · January 18, 2011, 5:31pm

I am on AIX Version 5.3!

Thanks,

guruprasadpr · January 18, 2011, 8:28pm

Hi

Try this:

awk '{x=$2;gsub(" ", "", x);print >x}' FS=, file

The output files generated will be like 100,200, so on.
Guru

rdcwayx · January 18, 2011, 10:34pm

awk -F , '{print > int($2) ".file"}' input.txt

kam66 · January 19, 2011, 9:40am

Hello Guru and rdcwayx,

Thanks for the solutions but it doesn't fulfil my requirement. As I mentioned my data file contains approx 4 million records and I want to create an output file of 500,000 recs each naming like file1, file2....file10.
While spliting a file when 500,000 rec mark is reached, I want to make sure that I am not spliting records of same key e.g.(100) across 2 output files so I want to keep all same key records in same output file, can be file1 or file2 doesn't matter.

Not very neat coding but I was able to split on every 500,000 recs by following code but keeping same key records is a challenge for me.

awk ' {
FS="~";
a=$2;
echo a;
          if(NR<500000) { print $0 > "file1"}
         if (NR>500000 && NR <= 1000000) { print $0 > "file2" }
          if (NR>1000000 && NR <= 1500000) {print $0 > "file3" }
           if (NR>1500000 && NR <= 2000000) {print $0 > "file4" }
            if (NR>2000000 && NR <= 2500000) {print $0 > "file5" }
             if (NR>2500000 && NR <= 3000000) {print $0 > "file6" }
              if (NR>3000000 && NR <= 3500000) {print $0 > "file7" }
               if (NR>3500000 && NR <= 4000000) {print $0 > "file8" }
                if (NR>4000000 && NR <= 4500000) {print $0 > "file9" }
                 if (NR>4500000 && NR <= 5000000) {print $0 > "file10" }
       }'  CF_SEQ.srt

Best regards,
K

rdcwayx · January 19, 2011, 6:55pm

awk 'BEGIN{i=1;t=0} {if (NR%500000==0) {t=1;a=int($2)} else {if (t==1&&int($2)!=a){t=0;++i}} {print > "file" i}}' infile