Split large file based on last digit from a column

alain.kazan · May 16, 2010, 2:15am

Hello,
What's the best way to split a large into multiple files based on the last digit in the first column.
input file:
f
2738483300000x0y03772748378831x1y13478378358383x2y23743878383802x3y33787828282820x4y43748838383881x5y5

Desired Output:
f0
3738483300000x0y03787828282820x4y4

f1
3772748378831x14y143748838383881x53y51

f2
3743878383802x28y73

f3
3478378358383x56y66

the file is about 60Million records, and im using grep to do the splitting but i guess there must be a faster way. grep ^3.........."$i" ( where value of i is from 0 to 9 )

Appreciate your ideas .

Alain

frans · May 16, 2010, 3:36am

Can you explain what logix is used : i.e. Why are there multiple x and y in files f0 and f1
How are the records and fields separated?

alain.kazan · May 16, 2010, 3:44am

Apologies, i pasted a table .
assume the input file is one column as below:
fileIni
------
198760
676549
378763
376830
367389
378383

i need to split the above file in muliple files based on the last digit of each value in that column ( i.e ending with 0 in one file , ending with 1 in another , till 9 - of course maximum possible seggragation is 10 files)

so the above will go into these files:
file0
-----
198760
376830

file2
-------
378763
378383

file3
-------
676549
367389

Thx

frans · May 16, 2010, 3:55am

while read L
do
    # use ONE of the lines below
    echo $L >> f${L:5}    # assuming records have length 6
    echo $L >> f${L:$((${#L}-1))} # else
done < fileIni

Better:

while read L
do
    echo $L >> f${L:(-1)}
done < fileIni

or in 1 line

while read L; do echo $L >> f${L:(-1)}; done < fileIni

alain.kazan · May 16, 2010, 4:43am

Thank you,
i tried it with a file having 3 columns as i expected the spliting would be based on the last digit of the first column, but i had the splitting based on the last ( 3rd ) column.
can you help please?

Alain

frans · May 16, 2010, 4:48am

If columns are space-separated

while read L M
do
    echo "$L $M" >> f${L:(-1)}
done < fileIni

else add

IFS="<separator>"

at the beginning

alain.kazan · May 16, 2010, 4:57am

Many Thanks!

Would you mind to explain how this partitioning is working :
specifically f${L:(-1)}

Alain

frans · May 16, 2010, 5:05am

it's the substring in bash

${<string>:<start>:<length>}

if length is omitted, extracts to end of string like

${<string>:<start>}

if start is negative, counts from end of string, but it must be in (), like in the given script.
NOTE: first character has index 0.
look there : Manipulating Strings

rdcwayx · May 16, 2010, 8:21pm

awk -F "" '{print > $NF ".txt"}' urfile

drl · May 17, 2010, 10:38am

Hi.

Once I process files above 1M lines, I tend to think in terms of {awk,perl}.

My awk code is about the same as that from rdcwayx -- not as clever as that, however: I used length and index functions.

I whipped up a small timing test framework, and created a 600K line file for comparison and easy arithmetic. These are the results:

% ./run

 Time for awk script on 600000 lines in file data1:

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
printf - is a shell builtin [bash]
specimen (local) 1.17
GNU Awk 3.1.5


-----
 || start [ first:middle:last ]
Edges: 5:0:5 of 600000 lines in file "data1"
298138
201666
307112
739965
187303
   ---
347236
113888
869875
958456
384340
 || end

-----
 Results:
 Processed 600000 lines.

-----
 Size of files:
  60250   60250  421750 f0
  59790   59790  418530 f1
  60266   60266  421862 f2
  59764   59764  418348 f3
  60043   60043  420301 f4
  59937   59937  419559 f5
  60029   60029  420203 f6
  59716   59716  418012 f7
  60110   60110  420770 f8
  60095   60095  420665 f9
 600000  600000 4200000 total

real	0m2.616s
user	0m1.820s
sys	0m0.312s

 Timing for shell loop on 600000 lines in file data1:

real	0m28.599s
user	0m15.465s
sys	0m9.269s

 Total number of lines in 10 files: 600000

So for a 60M file, one could estimate the time needed for a production run (on my workstation) by multiplying the times by 100.

The specific awk code was:

awk '
	{ digit = substr($0,length($0),1) 
	  print $0 > "f"digit
	}
END	{ print " Processed", NR, "lines." }
' $FILE

and the shell loop was:

while read L
do
    echo $L >> f${L:(-1)}
done < $FILE

Best wishes ... cheers, drl