Split a file into multiple files

pparthji · December 30, 2009, 1:59am

Hi,

i have a file like this:

1|2|3|4|5|
1|2|8|4|6|
Trailer1|||||
1|2|3|
Trailer2|||
3|4|5|6|
3|4|5|7|
3|4|5|8|
Trailer2|||

I want to generate 3 files out of this based on the trailer record. Trailer record string can be different for each file or it may be same for one or two.

No of files to be generated can vary as per the number of trailer records inside the input file.

Pl suggest how to implement this shell script?

pravin27 · December 30, 2009, 2:28am

Hi,
Try this...

#!/usr/bin/perl

$i=1;
open (FH,">${i}.txt");
while (<>) {
        if (/Trailer/){
                print FH $_;
                close(FH);
                if (eof()){
                close ARGV ;
                exit;
                }
                $i = $i + 1;
                open (FH,">${i}.txt");
                }
        else {
                print FH $_ ;
             }
}

pparthji · December 30, 2009, 2:30am

hi pravin,

i need the soluton in unix shell script.

xoops · December 30, 2009, 3:18am

try this

 
#!/bin/bash
i=1
IFS=$'\n'
for line in `cat input_file`
do
  echo $line >> ${i}.txt
  $(echo ${line} | grep -q Trailer)
  if [ $? -eq 0 ] ; then i=$(($i+1)) ;fi
done

input is read from file: input_file

ahmad.diab · December 30, 2009, 3:34am

solution below will work in all awk versions.

/usr/xpg4/bin/awk  -F"|" -v n=0 '
/^Trailer[0-9]*/{ close("out"n) ; n++ ; print > "out"n ; next}
{ print >> "out"n}'  infile.txt

:D:D:D

Scrutinizer · December 30, 2009, 3:42am

Try:

i=1
while read line
do
  echo $line >> $i.out
  case $line in
    Trailer*) i=$((i+1))
  esac
done<infile

awk equiv:

awk '{print > i".out"}/^Trailer/{++i}' i=1 infile

pparthji · December 30, 2009, 4:01am

Hi xoops,

in your script, script is reading the file again and again which hampers the performance. and besides that, grep command returns the all the matched patterns, for eg,

|1|2|3|
|T||||
|1|2|
|T||||
|1||2|3|4|5|
|T1||||

In this case. grep will always start from first.

---------- Post updated at 04:01 AM ---------- Previous update was at 03:50 AM ----------

hi ahmed,

trailer record format is different. its nt like Trailer1, Trailer2...it comes as an input parameter.

Input parameters to the script are

-s file name to be split (filetoBeSplit.dat)
-f split file names (filesplit1.dat, filesplit2.dat, filesplit3.dat...)
-t search pattern for trailer record (|T| |F| |Z|)

how can we specify different trailer record regexp in awk?

ahmad.diab · December 30, 2009, 4:11am

/usr/xpg4/bin/awk  -F"|" -v n=0 '
/^[TFZ]/{ close("filesplit"n".dat") ; n++ ; print > "filesplit"n".dat" ; next}
{ print >> "filesplit"n".dat"}'  filetoBeSplit.dat

summer_cherry · December 30, 2009, 4:18am

perl:

my $n=1;
my $file="file_".$n.".txt";
open FH,">>$file";
while(<DATA>){
  if(/Trailer/){
  	$n++;
  	$file="file_".$n.".txt";
  	close FH;
  	open FH,">>$file";
  	next;
  }
  print FH $_;
}
__DATA__
1|2|3|4|5|
1|2|8|4|6|
Trailer1|||||
1|2|3|
Trailer2|||
3|4|5|6|
3|4|5|7|
3|4|5|8|
Trailer2|||

Scrutinizer · December 30, 2009, 4:53am

Try:

pat=Trailer
i=1
while read line
do
  echo $line >> $i.out
  case $line in
    ${pat}*\|) i=$((i+1))
  esac
done<infile

awk equiv:

awk '{print > i".out"}$0~pat{++i}' pat="Trailer" i=1 infile

pparthji · December 30, 2009, 5:18am

Hi ahmed,
i tried the following command for

awk  -F"|" -v n=0 '/^[TFZ]/{ close("filesplit"n".dat") ; n++ ; print > "filesplit"n".dat" ; next} { print >> "filesplit"n".dat"}'  test1.dat

file

|1|2|3|4|5|
|1|2|3|4|4|
|1|2|3|4|3|
|T|one||||
|1|2|3|4|5|6|7|8|9|
|2|3|4|5|6|7|8|9|1|
|D|three|||||
|4|
|5|
|6|
|Z|four||||

but its generating one file filesplit0.dat containing all data...

ahmad.diab · December 30, 2009, 5:28am

modify the code to below:-

/usr/xpg4/bin/awk  -F"|" -v n=0 '
($2 ~/^[TFZ]/){ close("filesplit"n".dat") ; n++ ; print > "filesplit"n".dat" ; next}
{ print >> "filesplit"n".dat"}'  filetoBeSplit.dat

because the first filed now is null "" after putting "|" at the begining.

BR

TFZ ↩︎

pparthji · December 30, 2009, 5:35am

hi,

its nt generating the files properly:

/testDir> cat filesplit0.dat
|1|2|3|4|5|
|1|2|3|4|4|
|1|2|3|4|3|
/testDir> cat filesplit1.dat
|T|one||||
|1|2|3|4|5|6|7|8|9|
|2|3|4|5|6|7|8|9|1|
|D|three|||||
|4|
|5|
|6|
/testDir> cat filesplit2.dat
|Z|four||||

ahmad.diab · December 30, 2009, 5:39am

Is it what you want or not?

pparthji · December 30, 2009, 5:41am

No, its nt spliting the file on the basis of trailer record regular expression.
I need split like this:

filetoBeSplit.dat

|1|2|3|4|5|
|1|2|3|4|4|
|1|2|3|4|3|
|T|one||||
|1|2|3|4|5|6|7|8|9|
|2|3|4|5|6|7|8|9|1|
|D|three|||||
|4|
|5|
|6|
|Z|four||||

After split:

cat filesplit0.dat
 
|1|2|3|4|5|
|1|2|3|4|4|
|1|2|3|4|3|
|T|one||||
 
cat filesplit1.dat

|1|2|3|4|5|6|7|8|9|
|2|3|4|5|6|7|8|9|1|
|D|three|||||

cat filesplit2.dat
 
|4|
|5|
|6|
|Z|four||||

Scrutinizer · December 30, 2009, 6:08am

Or simply:

while change specification; do
  generate new awk
  generate alternative new awk
done

output:

awk '{print > i".out"}$0~pat{++i}' pat='^\|[DTZF]' i=1 infile

awk -F '|' '{print > i".out"}$1=/[DTZF]/{++i}' i=1 infile

ahmad.diab · December 30, 2009, 6:19am

ok ...do the below modification it is just re-arranging the
commands orders:- :D:D:D

/usr/xpg4/bin/awk  -F"|" -v n=0 '
($2 ~/^[TFZ]/){
print > > "filesplit"n".dat"
close("filesplit"n".dat")
n++
next
}
{ print >> "filesplit"n".dat"}'  filetoBeSplit.dat

pparthji · December 30, 2009, 6:21am

hi,

i am nt able to understand
change specification; do
generate new awk
generate alternative new awk

can't we have done this thing in one awk without using loop?

ahmad.diab · December 30, 2009, 6:24am

Is it the right now? with correct o/p?

pparthji · December 30, 2009, 6:33am

No,

its nt generating correct o/p:

testDir> cat filesplit0.dat
|1|2|3|4|5|
|1|2|3|4|4|
|1|2|3|4|3|
/testDir> cat filesplit1.dat
|1|2|3|4|5|6|7|8|9|
|2|3|4|5|6|7|8|9|1|
|D|three|||||
|4|
|5|
|6|
/testDir> cat filesplit1.dat0
|Z|four||||
testDir"]/testDir> cat filesplit0.dat0
|T|one||||

this is wrong. there has to be three files only...with correct o/p...