Every nth line with different starting point

Hi every one,

I am trying to generate different files from a column of numbers with the following format, as an example:

main file(input)

1                                           
2                                         
3                                          
.                                            
.
2000
2001
2002
.
.
4000
4001
4002
.
.
6000
6001
6002

The outputs

file1     file2        file3 
1         2            3
2000      2001         2002
4000      4001         4002
6000      6001         6002

So, the idea is to pick every 2000th data (with different starting line) and put them into a new file, this process goes down to the end of file.
I've tried AWK/sed with while loops but it does not work because AWK/sed dont accept variables ($i) in their arguments. The code is in CSHELL.

I very much appreciate your help

Hi.

Let me know if I misunderstood you:

awk '{A = A?A " "$1:$1} (NR%LN+1) == LN {print A > "file" ++C; A=""} END { if (A) print A > "file" ++C}' LN=2000 infile

Thank you so much Scottn for your prompt reply.
Is that supposed to creat 3 different files with the following formats:

The outputs

file1

1 
2000
4000 
6000 
 
file2 
2 
2001
4001
6001
 
file3
3
2002
4002
6002

If so, what's A? or what does A=A?A do?

Thanks again

Hi.

Yep! I misunderstood you.

awk '{print > "file" (NR-1)%3+1}' infile 

This works perfectly for limited number of files, however if I am to deal with the entire data, actual case, there will be 2000 ouput files which the code gives me error, " too many output files", e.g. if I try 2000 instead of 3 in NR-1%3+1 it doesn't like it!!

Any idea for the entire data?

cheers

awk '{close(out); out="file" (NR-1)%3+1; print > out}' infile

Thanks vger. I knew I should have had more than 30 records in my test scenario!

awk '{out="file" (NR-1)%3+1; print > out; close(out) }' infile

Thanks!!
well, the actual text file has 59850 records where I have to pick every 3325 records and output to different files!! So, i have to place some number around 18 which still gives me the same error code " too many output files" !!
Can you please help me more? Also, what if i have more than one field in the input file? will a for loop inside AWk structure do?

Actually, I've just run this with 600,000 records using the original awk

awk '{print > "file" (NR-1)%3+1}' infile

and it did not complain, which in hindsight makes sense since it's only opening three files.

And with 10000 files:

 awk '{print > "file" (NR-1)%10000+1}' infile

No problem there either.

You can have as many fields as you like, it writes the whole record to the files.

Which OS / awk version are you using?

Here a version that works without running into the open file limit. It's an awk script and should be in a file:

BEGIN {
        print RF_CNT;
        print LN;
}

(NR <= RF_CNT) {                # output first records
                a[NR] = "file" NR;
                print >a[NR];
                close(a[NR]);   # expensive but stays under fopen limit
}

((i = NR%LN +1) in a) {         # see if this line is a candidate
                print >>a;   # add it
                close(a);    # close it to stay under fopen limit
}

Use this command line for it:

awk -f num.awk -v RF_CNT=20 -v LN=2000 num.txt

The value of RF_CNT is the number of files to make. The first line of each file will contain line N of the input file (ie. file 4 contains line 4 of the input file).

The value of LN is the line offset/modulo to determine if line need to go to a file and which file.

num.txt is input file. I used a file with the numbers 1-8000 on each line.

It opens and closes the output files to avoid the open file limits...

The BEGIN clause is just to confirm inputs and can be removed....

Dear Scottn,

I switched to SUN systems and the code 'kinda' worked. I mean, the numer of output files were correct but I had only one line in each output file. For the test case, I had one input file with 40950 records and I wanted multipliations of 2275 to be printed in each file which should result in 40950/2275=18 lines in each file.
In the last file, file2275, i had the last line of input file, for the file2274 i had 40949th line of the input file printed and so on to file1!!!!
I do not know what's wrong and am really confused.
The code works for small number of modulos like up to 10 but doesn't like big numbers!!

JP2542a

I tried your code but it gives me an error on the line begining with (( i = NR....). syntax error and bailing out near this line!!

Thank you so much

We'll get there!

Use /usr/xpg4/bin/awk on Solaris. I say this every time the word Solaris crops up. Gives me time to think of something more useful to say!!

I will try it on Solaris tomorrow.

Cheers

I ran the script on both CentOS and cygwin and it worked. What exactly did you type and what was the exact output?

Dear Scottn,

Surprisingly, when i tried your first code on SUN, i.e.

awk '{print > "file" (NR-1)%3+1}' infile

it worked pretty well!! whereas when i tried the code with close (out) the outputs were garbage!
So, if i have 10 fields and i want to have each field processed seperatly, what should i do? By this I mean if 2000 files is generated from one column I would like to end up with 2000*10 files. Does a for loop
(for i=1;i<=FN;i++) works? if so, where in the code?

Dear JP2542a,
The following is the code i've used. The bug shows up when i try it on Solaris machines while when i run it on SUNs it doesn't complain but it doesn't generate any files. Don't bother yourself if it's sth wierd coz i've been succefully run Sottn's code.

I should very much appreciate your help:)
I'll probably get back here with some other questions.

Cheers

  BEGIN {print RF_CNT;print LN;}(NR <= RF_CNT) { a[NR] = "file" NR;print >a[NR];close(a[NR]);}(( i = NR %LN + 1 ) in a ) {print >> a;close(a);}

I'm always surprised when my code works pretty well!!

You don't have 2000 files, you have 3 files. Do you mean you want 3 * 10 files?

If so, then a for-loop, as you suggest would do fine.

(untested because my VM just went on the blink, something like...)

{ for( I = 1; I <= NF; I++ )
    print > "file" (NR-1)%3+1 "_" I
}

well, (NR-1)%3+1 was just a test, the increments between lines were actually 3325 in my actual input file, so i replace 3 with 3325! so i am dealing with sth like 33250 files in total, sonsidering 10 fields!
cheers

Yes, you're right, sorry, forgot you changed that.

Hi Scottn,

I tried to change the code the get the desired outputs but it didnt happen. Assuming (NR-1%3+1), I'll bring a simple example to simplify what I am willing to have:


input file



1      7       13

2      8       14

3      9       15

4     10      16

5     11       17

6     12       18



output files



file1     file2     file3

1          2         3

4          5         6



file4       file5    file6

7            8        9

11          12      13



file7       file8     file9

13          14        15

16          17        18

So, for each fileld there will be three files containing only contents of field 1 and the same for other two fields. In general, the number of sequential numbers would be 3*3=9.

Do you have any tricks for this?


awk '
{ for( I = 1; I <= NF; I++ ) {
print $I > "file" (NR-1)%FILES+1 + (I-1)*FILES
}
}
' FILES=3 infile

Your command line should look like this if you don't put the code in a file:

awk -v RF_CNT=2000 -v LN=2000 ' BEGIN {print RF_CNT;print LN;}(NR <= RF_CNT)a[NR] ="file" NR;print >a[NR];close(a[NR]);}(( i = NR %LN + 1 ) in a ){print >> a;close(a);}' num.txt