Help on Spliting files - urgent

rajee · March 6, 2008, 3:26pm

Hi Script Masters I have a strange requirement. Please help.
I am using C shell.
I have a file like the below in sorted order
22
23
25
34
37
45
67
342
456
476
543
677
789
Now I have to split the file in such a way that first 5 of 2 digit number should be saved as aaa.in and the next 5 of 2 digit number should be saved as aab.in and this should continue for all remaning 2 digits (any count) and the first 5 of 3 digit number should be saved in aac.in and the next 5 in aad.in and so on...

In nut shell

file aaa.in should contain
22
23
25
34
37
and file aab.in should contain
45
67
and file aac.in
342
456
476
543
677
and file aad.in should be
789

The split up of 5 number per file is constant and other things will vary..

If you have any doubts please ask me. Any help or clue will be greatly appreciated.
I know split command can create multiple files like aaa,aab but the problem I see it that the filename is not getting carried on .If I use split 2 times it will overwrite aaa file instead of creating aac ..

radoulov · March 6, 2008, 4:16pm

awk 'BEGIN { split("a b c d e f", f) }
x[length]++ == 5 || !y[length]++ { close(fn); fn = "aa"f[++c]".in" }
{ print > fn }' file

Use nawk or /usr/xpg4/bin/awk on Solaris.
Add more letters if needed

rajee · March 6, 2008, 4:51pm

thanks for the reply rad.
When I tried the code i am getting as
Awk :Syntax error near line 2
Awk : bailing out near line 2.
Can you please tell me what could be wrong.

Actually There might be a file containing around 500,000 lines of numbers so when I split it it might even go for aaa..aab..aac and so on to aba abb abc and so on to aca acb acc and to maximum zzz. From your code I beieve it ca maximum go for aaa to aaz.

The file naming convention is cinaaa.in cinaab.in ...cinbaa...cinzzz.in

What kind of change I have to do in your code for accomadating the above stuffs.

aajan · March 7, 2008, 6:37am

Hope this should work!!!!!!!!!!

#!/usr/bin/ksh
i=1
j=0
k=0
cat samp | while read line
do
if [ $line -lt 100 ]
then
arr[$i]=$line
if [ $j -le 5 ]
then
echo ${arr[$i]}
echo ${arr[$i]} >> aa$k.txt
i=`expr $i + 1`
j=`expr $j + 1`
fi
if [ $j -eq 5 ]
then
j=0
k=`expr $k + 1`
fi
fi
done

The above script wil split the files for two-digit numbers..
And you complete the rest..

radoulov · March 7, 2008, 7:27am

You should use nawk as suggested in my first reply.

OK,
try this:

nawk 'BEGIN { 
n = split("a b c d e f g h i j k l m n o p q r s t u v w x y z", f)
c2 = c1 = c = 1 }
!x[length]++ % 5 {
close(fn); fn = "cin"f[c2] f[c1] f[c]".in"
if ((c1 == n) && (c == n)) 
   c2 = c2 >= n ? 1 : ++c2
c = c >= n ? 1 : ++c
if (c == 1)
  c1 = c1 >= n ? 1 : ++c1 }
{ print > fn }' file

rajee · March 7, 2008, 1:22pm

Hi radoulov
I agree you are a great script master and specialist in nawk. Thanks for the code and it works fine . If you have time, please explain the code. Thanks for the effort and time you took .

Aajan
Thanks for your effort too, actually my requirement is to create file names with cinaaa,aab,aac and so on..but your code creates output as aa0,aa1,aa2 and so on but still I extend my thanks for a different idea on your approach also your code might be little slower for huge data.

Thanks for all who took time to view my problem and took effort to solve it

Great Forum!

radoulov · March 10, 2008, 6:16am

BEGIN {
n = split("a b c d e f g h i j k l m n o p q r s t u v w x y z", f)
c2 = c1 = c = 1 }

Prepare the f array (the alphabet) and set c,c1 and c2 to 1 and n to the number of elements in the f array.

!x[length]++ % 5

I think that an example will illustrate best this expression:

% awk '{print x[length]++%5==0?"here -->":"--------",$0}' file
here --> 22
-------- 23
-------- 25
-------- 34
-------- 37
here --> 45
-------- 67
here --> 342
-------- 456
-------- 476
-------- 543
-------- 677
here --> 789

In other words, we change the filename every time the expression x[length]++ % 5 is 0.

Now, the filename generation.

close(fn); fn = "cin"f[c2] f[c1] f[c]".in"
if ((c1 == n) && (c == n))
   c2 = c2 >= n ? 1 : ++c2
c = c >= n ? 1 : ++c
if (c == 1)
  c1 = c1 >= n ? 1 : ++c1 }

We need to close the previous file because some Awk implementations (such as Awk on Solaris) can open a limited number of files at the same time.
Then we compose the filename: the f array with rotating keys (from 1 to n, where n is the number of elements in f the array).

{ print > fn }

This simply prints into the changing filename.

Hope this helps.

rajee · March 10, 2008, 12:56pm

Hi Rad
Thanks for the explanation of the code. I have made few changes just for the readability of the code. Hope I have done the changes correctly.Also I would like to know how x[length] automatically resets to 0 everytime when the length of the digit changes .for eg.if the numbers are 23,45,56,234 then x[length] is 1,2,3 and then 1 again.. I understood the rest of the code.

nawk 'BEGIN {
n=split("a b c d e f g h i j k l m n o p q r s t u v w x y z",f)
c2=c1=c=1
}
{
if (x[length] % 5 == 0)
{
close(fn);
fn="cin"f[c2]f[c1]f[c]".in"
if ((c1==n) && (c==n))
c2=c2>=n?1:++c2

c=c>=n?1:++c

if (c==1)
c1=c1>=n?1:++c1
}
x[length]=x[length] + 1
}
{ print > fn }' cin.in

radoulov · March 10, 2008, 1:01pm

Yes,
consider the following:

$ print '1
22
22
22
333
333
4444
4444'|nawk '{print $1,x[length]++}'
1 0
22 0
22 1
22 2
333 0
333 1
4444 0
4444 1
$