Every nth line with different starting point

nxp · October 3, 2009, 12:58pm

Hi every one,

I am trying to generate different files from a column of numbers with the following format, as an example:

main file(input)

The outputs

file1     file2        file3 
1         2            3
2000      2001         2002
4000      4001         4002
6000      6001         6002

So, the idea is to pick every 2000th data (with different starting line) and put them into a new file, this process goes down to the end of file.
I've tried AWK/sed with while loops but it does not work because AWK/sed dont accept variables ($i) in their arguments. The code is in CSHELL.

I very much appreciate your help

Scott · October 3, 2009, 1:17pm

Hi.

Let me know if I misunderstood you:

awk '{A = A?A " "$1:$1} (NR%LN+1) == LN {print A > "file" ++C; A=""} END { if (A) print A > "file" ++C}' LN=2000 infile

nxp · October 3, 2009, 2:24pm

Thank you so much Scottn for your prompt reply.
Is that supposed to creat 3 different files with the following formats:

The outputs

file1

If so, what's A? or what does A=A?A do?

Thanks again

Scott · October 3, 2009, 2:56pm

Hi.

Yep! I misunderstood you.

awk '{print > "file" (NR-1)%3+1}' infile

nxp · October 3, 2009, 3:57pm

This works perfectly for limited number of files, however if I am to deal with the entire data, actual case, there will be 2000 ouput files which the code gives me error, " too many output files", e.g. if I try 2000 instead of 3 in NR-1%3+1 it doesn't like it!!

Any idea for the entire data?

cheers

vgersh99 · October 3, 2009, 4:03pm

awk '{close(out); out="file" (NR-1)%3+1; print > out}' infile

Scott · October 3, 2009, 4:18pm

Thanks vger. I knew I should have had more than 30 records in my test scenario!

awk '{out="file" (NR-1)%3+1; print > out; close(out) }' infile

nxp · October 3, 2009, 6:38pm

Thanks!!
well, the actual text file has 59850 records where I have to pick every 3325 records and output to different files!! So, i have to place some number around 18 which still gives me the same error code " too many output files" !!
Can you please help me more? Also, what if i have more than one field in the input file? will a for loop inside AWk structure do?

Scott · October 3, 2009, 7:06pm

Actually, I've just run this with 600,000 records using the original awk

awk '{print > "file" (NR-1)%3+1}' infile

and it did not complain, which in hindsight makes sense since it's only opening three files.

And with 10000 files:

 awk '{print > "file" (NR-1)%10000+1}' infile

No problem there either.

You can have as many fields as you like, it writes the whole record to the files.

Which OS / awk version are you using?

jp2542a · October 3, 2009, 7:45pm

Here a version that works without running into the open file limit. It's an awk script and should be in a file:

BEGIN {
        print RF_CNT;
        print LN;
}

(NR <= RF_CNT) {                # output first records
                a[NR] = "file" NR;
                print >a[NR];
                close(a[NR]);   # expensive but stays under fopen limit
}

((i = NR%LN +1) in a) {         # see if this line is a candidate
                print >>a;   # add it
                close(a);    # close it to stay under fopen limit
}

Use this command line for it:

awk -f num.awk -v RF_CNT=20 -v LN=2000 num.txt

The value of RF_CNT is the number of files to make. The first line of each file will contain line N of the input file (ie. file 4 contains line 4 of the input file).

The value of LN is the line offset/modulo to determine if line need to go to a file and which file.

num.txt is input file. I used a file with the numbers 1-8000 on each line.

It opens and closes the output files to avoid the open file limits...

The BEGIN clause is just to confirm inputs and can be removed....

nxp · October 3, 2009, 9:47pm

Dear Scottn,

I switched to SUN systems and the code 'kinda' worked. I mean, the numer of output files were correct but I had only one line in each output file. For the test case, I had one input file with 40950 records and I wanted multipliations of 2275 to be printed in each file which should result in 40950/2275=18 lines in each file.
In the last file, file2275, i had the last line of input file, for the file2274 i had 40949th line of the input file printed and so on to file1!!!!
I do not know what's wrong and am really confused.
The code works for small number of modulos like up to 10 but doesn't like big numbers!!

JP2542a

I tried your code but it gives me an error on the line begining with (( i = NR....). syntax error and bailing out near this line!!

Thank you so much

Scott · October 3, 2009, 9:59pm

We'll get there!

Use /usr/xpg4/bin/awk on Solaris. I say this every time the word Solaris crops up. Gives me time to think of something more useful to say!!

I will try it on Solaris tomorrow.

Cheers

jp2542a · October 3, 2009, 10:26pm

I ran the script on both CentOS and cygwin and it worked. What exactly did you type and what was the exact output?

nxp · October 4, 2009, 11:51am

Dear Scottn,

Surprisingly, when i tried your first code on SUN, i.e.

awk '{print > "file" (NR-1)%3+1}' infile

it worked pretty well!! whereas when i tried the code with close (out) the outputs were garbage!
So, if i have 10 fields and i want to have each field processed seperatly, what should i do? By this I mean if 2000 files is generated from one column I would like to end up with 2000*10 files. Does a for loop
(for i=1;i<=FN;i++) works? if so, where in the code?

Dear JP2542a,
The following is the code i've used. The bug shows up when i try it on Solaris machines while when i run it on SUNs it doesn't complain but it doesn't generate any files. Don't bother yourself if it's sth wierd coz i've been succefully run Sottn's code.

I should very much appreciate your help:)
I'll probably get back here with some other questions.

Cheers

  BEGIN {print RF_CNT;print LN;}(NR <= RF_CNT) { a[NR] = "file" NR;print >a[NR];close(a[NR]);}(( i = NR %LN + 1 ) in a ) {print >> a;close(a);}

Scott · October 4, 2009, 12:53pm

I'm always surprised when my code works pretty well!!

You don't have 2000 files, you have 3 files. Do you mean you want 3 * 10 files?

If so, then a for-loop, as you suggest would do fine.

(untested because my VM just went on the blink, something like...)

{ for( I = 1; I <= NF; I++ )
    print > "file" (NR-1)%3+1 "_" I
}

nxp · October 4, 2009, 1:04pm

well, (NR-1)%3+1 was just a test, the increments between lines were actually 3325 in my actual input file, so i replace 3 with 3325! so i am dealing with sth like 33250 files in total, sonsidering 10 fields!
cheers

Scott · October 4, 2009, 1:21pm

Yes, you're right, sorry, forgot you changed that.

nxp · October 4, 2009, 5:09pm

Hi Scottn,

I tried to change the code the get the desired outputs but it didnt happen. Assuming (NR-1%3+1), I'll bring a simple example to simplify what I am willing to have:


input file



1      7       13

2      8       14

3      9       15

4     10      16

5     11       17

6     12       18



output files



file1     file2     file3

1          2         3

4          5         6



file4       file5    file6

7            8        9

11          12      13



file7       file8     file9

13          14        15

16          17        18

So, for each fileld there will be three files containing only contents of field 1 and the same for other two fields. In general, the number of sequential numbers would be 3*3=9.

Do you have any tricks for this?

Scott · October 4, 2009, 5:40pm

awk '
{ for( I = 1; I <= NF; I++ ) {
print $I > "file" (NR-1)%FILES+1 + (I-1)*FILES
}
}
' FILES=3 infile

jp2542a · October 5, 2009, 5:18am

Your command line should look like this if you don't put the code in a file:

awk -v RF_CNT=2000 -v LN=2000 ' BEGIN {print RF_CNT;print LN;}(NR <= RF_CNT)a[NR] ="file" NR;print >a[NR];close(a[NR]);}(( i = NR %LN + 1 ) in a ){print >> a;close(a);}' num.txt