Split File Based on Line Number Pattern

shankster · September 30, 2008, 11:53am

Hello all.

Sorry, I know this question is similar to many others, but I just can seem to put together exactly what I need.

My file is tab delimitted and contains approximately 1 million rows. I would like to send lines 1,4,& 7 to a file. Lines 2, 5, & 8 to a second file. Lines 3, 6, & 9 to a third file, and then line 10 to a fourth file. I then want to repeat this condition using the same scenario, and the same four files above. Any thoughts on the best approach?

joeyg · September 30, 2008, 12:43pm

But, I will need some awk help (or to think a little clearer after eating lunch)

> cat big_file4
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
d stuff to 4 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
d stuff to 4 file

What I initially wrote does not capture the file line of text - and where I think I need some HELP!

> cat -n big_file4 | awk '{printf "%1s %-15s \n", substr($1,length($1),1), $2}'
1 a               
2 b               
3 c               
4 a               
5 b               
6 c               
7 a               
8 b               
9 c               
0 d               
1 a               
2 b               
3 c               
4 a               
5 b               
6 c               
7 a               
8 b               
9 c               
0 d

Because from here, my theory is that

grep "^[147] " <infile >outfile_a
grep "^[258] " <infile >outfile_b
grep "^[369] " <infile >outfile_c
grep "^[0] " <infile >outfile_d

May need to cut before writing to each output.

era · September 30, 2008, 12:55pm

Perl or Python looping over a set of file handles would seem like the most i efficient approach. For a more pedestrian solution, an awk script run four times with different parameters might be acceptable even if the file is big.

Does file four only contain every tenth line, and then 11, 14, and 17 go to the first file again?

perl -MIO::File -ne 'BEGIN { map { $file[$_] = IO::File->new(">file$_") || die $!} 0..3; 
  @m = (0, 1, 2, 0, 1, 2, 0, 1, 2, 3);
}
$file[$m[$. % 9]]->print || die $!'

csplit has some fairly versatile options, you might be able to pull this off simply with a suitable csplit pattern as well.

shankster · September 30, 2008, 1:03pm

Yes, 11,14, and 17 would then go to the first file again.

I am trying to use KSH to complete this task. Below is what I have so far, but the count variable does not appear to be resetting to 1 after it reaches 11. Also, I am getting output similar to:

File_split_DC.sh[42]: 2: not found.
File_split_DC.sh[42]: 3: not found.
File_split_DC.sh[42]: 4: not found.

The name of my script is "File_split_DC.sh"

#!/usr/bin/ksh

count=1

while read line
do

case $count in
1)
echo "$line" >> RT1.txt
;;
2)
echo "$line" >> RT2.txt
;;
3)
echo "$line" >> RT3.txt
;;
4)
echo "$line" >> RT1.txt
;;
5)
echo "$line" >> RT2.txt
;;
6)
echo "$line" >> RT3.txt
;;
7)
echo "$line" >> RT1.txt
;;
8)
echo "$line" >> RT2.txt
;;
9)
echo "$line" >> RT3.txt
;;
10)
echo "$line" >> RT4.txt
;;
esac
(( count+=1 ))

if $count -gt 10; then
count=1

fi
done < My_Test.txt

exit 0

era · September 30, 2008, 1:15pm

You want

if [ $count -gt 10 ]; then

It would be more efficient to open four file descriptors and then just print to those descriptors; this approximates the Perl approach I suggested above.

exec 1>rt1.txt 2>rt2.txt 3>rt3.txt 4>rt4.txt
count=1
while read line; do
  case $count in
    1|4|7) print "$line" >&1;;
    2|5|8) print "$line" >&2;;
    3|6|9) print "$line" >&3;;
    10) print "$line" >&4; count=0;;
  esac
  count=`expr $count + 1`
done <My_Test.txt

Note the use of print rather than echo -- this is ksh-specific, but other than that, this script should be portable.

joeyg · September 30, 2008, 1:15pm

> cat -n big_file4 | awk '{printf "%1s %-100s \n", substr($1,length($1),1), $0}' | cut -c1,10- | grep "^[147]" | cut -c2- >filea
> cat -n big_file4 | awk '{printf "%1s %-100s \n", substr($1,length($1),1), $0}' | cut -c1,10- | grep "^[258]" | cut -c2- >fileb
> cat -n big_file4 | awk '{printf "%1s %-100s \n", substr($1,length($1),1), $0}' | cut -c1,10- | grep "^[369]" | cut -c2- >filec
> cat -n big_file4 | awk '{printf "%1s %-100s \n", substr($1,length($1),1), $0}' | cut -c1,10- | grep "^[0]" | cut -c2- >filed

> cat big_file4
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
d stuff to 4 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
a stuff to 1 file
b stuff to 2 file
c stuff to 3 file
d stuff to 4 file

and now the four separated files

> cat filea
a stuff to 1 file                                                                             
a stuff to 1 file                                                                             
a stuff to 1 file                                                                             
a stuff to 1 file                                                                             
a stuff to 1 file                                                                             
a stuff to 1 file                                                                             
> cat fileb
b stuff to 2 file                                                                             
b stuff to 2 file                                                                             
b stuff to 2 file                                                                             
b stuff to 2 file                                                                             
b stuff to 2 file                                                                             
b stuff to 2 file                                                                             
> cat filec
c stuff to 3 file                                                                             
c stuff to 3 file                                                                             
c stuff to 3 file                                                                             
c stuff to 3 file                                                                             
c stuff to 3 file                                                                             
c stuff to 3 file                                                                             
> cat filed
d stuff to 4 file                                                                             
d stuff to 4 file                                                                             
>

shankster · September 30, 2008, 1:26pm

Thanks to both of you for your input. I really don't know what I'm doing when it comes to UNIX, so I just try to piece tidbits together. I ended up using ERA's approach in the second posting. It was similar to what I had already put together, and made sense. JOEYG, I'm sure your appraoch would work as well, and I appreciate your input.

joeyg · September 30, 2008, 1:35pm

I started working on my approach, and thought of trying as era did. I wanted to finish my approach, thinking the grabbing of line # from the cat -n command, and using it to determine a next step might prove an interesting exercise.
Also, era's solution would probably run a lot faster than mine!

radoulov · September 30, 2008, 3:05pm

With AWK (if I'm not missing something):

[use nawk or /usr/xpg4/bin/awk on Solaris]

awk '!(NR%10){print>(FILENAME 4);next} 
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' filename

For best performance use mawk if available:

% repeat 1000000 print ${(l:100::x:)l=line}$((++i)) >> data

% wc data
  1000000   1000000 106888896 data

% time gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data

gawk  data  3.28s user 0.37s system 97% cpu 3.756 total

% time mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data

mawk  data  1.44s user 0.42s system 95% cpu 1.939 total

% time nawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data

nawk  data  8.07s user 3.61s system 93% cpu 12.516 total

radoulov · October 1, 2008, 2:15am

Actually, this is sufficient

awk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' filename

era · October 1, 2008, 2:54am

Just for completeness, I should note that the modulo arithmetic in the Perl script I posted was a major brain fart. Here's a hopefully corrected version, with an explanation.

perl -MIO::File -ne 'BEGIN {
  @n = ("one.txt", "two.txt", "three.txt", "four.txt");
  map { $file[$_] = IO::File->new(">$n[$_]") || die $!} 0..3; 
  @m = (3, 0, 1, 2, 0, 1, 2, 0, 1, 2);
}
$file[$m[$. % 10]]->print || die $!' filename

I threw in the mapping of arbitrary file names in the array @n for show.

The BEGIN block creates an array @file of four file handles (indexed 0 through 3 -- Perl arrays start at zero) and a mapping @m of which line number to print to which handle. Somewhat confusingly, the first entry in the mapping is for line numbers 10, 20, 30, ... (array index zero), while only the second is for line numbers 1, 11, 21, etc.

In the main loop (outside the BEGIN block) we simply calculate the remainder (modulo) of the line number $. divided by 10 (not 9!!) and use that as an index into @m to get the handle index, and then through another level of indexing print to the handle we are pointed to.

Also for the record, the shell version will have an issue if there is input with backslashes in it. Change read to read -r or if your shell doesn't support that, see if you have the line command instead. Also for maintainability I suppose it would be better to use higher-numbered file descriptors -- file descriptors 1 and 2 are reserved for standard output and standard error, as you probably know. (I wanted to keep them in sync to make the script easier to follow, but it sucks if you try to debug it and lose all your errors into a file someplace.)

As usual, Radoulov's solution is impressive, though a bit hard to follow. Apparently the names of the output files will be the input file name with a number suffix added.

I speculate that mawk keeps the file handles open just in case, i.e. secretly does the file handle juggling that I did explicitly in the Perl script. (Incidentally, you don't really need IO::File for that, but it makes it a lot more readable -- the stuff you have to do to manipulate bare file handles in bare Perl is arcane even by Perl standards.)

radoulov · October 1, 2008, 9:49am

Thanks era!

As far as I know [ngm]awk should maintain the files open until the end of the program or an explicit close call (close(filename)):

% strace  -eopen mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
open("tls/i686/sse2/cmov/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
[snip]
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
open("data", O_RDONLY)                  = 3
open("data1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
open("data2", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
open("data3", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
open("data4", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 7
Process 8618 detached

% strace  -eopen gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
open("tls/i686/sse2/cmov/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
[snip]
open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY) = 3
open("data", O_RDONLY|O_LARGEFILE)      = 3
open("data1", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4
open("data2", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 5
open("data3", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 6
open("data4", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 7
Process 8641 detached

Reading the strace output I notice some differences in read/write calls timings.
I'm quite sure that the below output does not show all time consuming events.

% strace -c mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7865 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.48    0.003954           0     26097           write
 25.83    0.001390           0     26313           read
  0.69    0.000037           1        57        49 open
  0.00    0.000000           0        10           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         4         4 access
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         5         5 ioctl
  0.00    0.000000           0         5           munmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        13           mmap2
  0.00    0.000000           0        16        15 stat64
  0.00    0.000000           0         7           fstat64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.005381                 52536        73 total

% rm data[1-4]                                     
% sync;sync                                        
% strace -c gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7883 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.16    0.004391           0     26097           write
 27.21    0.001656           0     26102           read
  0.62    0.000038           0        89        72 open
  0.00    0.000000           0        17           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         5         5 access
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         6         5 ioctl
  0.00    0.000000           0         6           munmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         4           _llseek
  0.00    0.000000           0         3           rt_sigaction
  0.00    0.000000           0        22           mmap2
  0.00    0.000000           0        16        15 stat64
  0.00    0.000000           0        25           fstat64
  0.00    0.000000           0         2           getgroups32
  0.00    0.000000           0        13           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.006085                 52416        97 total

% rm data[1-4]                                     
% sync;sync                                        
% strace -c newawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7943 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.90    0.123052           0   1000000           write
  1.10    0.001368           0     26101           read
  0.00    0.000000           0        64        52 open
  0.00    0.000000           0        15           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         4         4 access
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         7           munmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           rt_sigaction
  0.00    0.000000           0        18           mmap2
  0.00    0.000000           0        16        15 stat64
  0.00    0.000000           0        12           fstat64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.124420               1026246        71 total