Split File based on number of rows

kpk_ds · November 19, 2014, 12:08am

Hi

I have a requirement, where i will receive multiple files in a folder (say: /fol1/fol2/). There will be at least 14 to 16 files. The size of the files will different, some may be 80GB or 90GB, some may be less than 5 GB (and the size of the files are very unpredictable). But the names of the files will be have a particular format like "Table1_Insert.dat" , Table1_Update.dat, Table1_delete.dat, Table2_ins.dat, Table2_upd.dat, Table2_del.dat... like this...

I have to read one file at a time, check the size of the file (in GB), if the file size is greater than 90 GB (file size wont be more than 100GB always), then split the files into 5GB. So if the file size is 90 GB, then it should split the source file into 18 sub files (like TT_table1_ins.dataa, TT_Table1_ins.datab , TT_Table1_ins.datac... etc)

I want my script to take only one input argument - just the file name (with the path).

I know we can do this using split -l command, but i need some help. Can somebody help me with a script. I'm very new to shell scripting. I can understand the commands but cannot write a script...

Thanks

Don_Cragun · November 19, 2014, 2:43am

Is this a homework assignment?

kpk_ds · November 19, 2014, 2:57am

No this is not homework... may be i gave too much explanation... thats why it seems like home work... but this is my work...

RudiC · November 19, 2014, 8:57am

Why do you need a script? What stops you from using it as a one line command like

[ $(stat -c"%s" Table1_ins.dat) -gt 90000000000 ] && split -b5000000000 -a1 --verbose Table1_ins.dat TT_table1_ins.dat

This splits by bytes and could split half lines; use -l with an average line length to keep entire lines.

Don_Cragun · November 19, 2014, 2:30pm

You did not give too much information! (I've never seen anyone make that mistake in these forums!)

If the 1-liner RudiC gave you works, you can put that into a script.

If it doesn't, show us the diagnostic messages it prints and tell us what operating system and shell you're using. (The stat utility and the split --verbose option are not available on all operating systems.)

Chubler_XL · November 19, 2014, 8:19pm

Some splits (GNU for example) support the --line-bytes=SIZE to keep whole lines eg:

cd /fol1/fol2
for file in *.dat
do
    if [ $(stat -c"%s" "$file") -gt 90000000000 ]
    then
        split --line-bytes=5000000000 -a1 --verbose "$file" "TT_$file"
    fi
done

kpk_ds · November 22, 2014, 3:33pm

Hi
Thanks for the replies. I apologize for the delay.

I tried the following, and It didn't work.
I tried:

stat -c"%s" FileName.txt

I got:

ksh: stat:  not found

I tried:

stat -f FileName.txt

I got:

ksh: stat:  not found

So here's what I tried. (I tried this in DataStage - on AIX). I created a DataStage code, that will read the and use some simple UNIX commands and do this.

Check if the Directory exist, if yes remove all the files inside the directory and the directory.
Then check the file size:

ls -lSr /inputpath/Filename.txt | tail -n1 | awk '{$5=sprintf("%.9f GB", $5/1024^3)} 1' | cut -d' ' -f5 | awk '{printf "%.0f\n", $1}'

This will give me the file size in GB. if the file size is small, then it tells me 0.

(but now I found a new way to get the size: ls -l filename.txt | awk '{print $5}'

calculate Split file count (number of files to split): Here I will get the size from above and see how many pieces I can split them. For example, if the file size if 100 GB, then I want to split then into 20 pieces (5GB Each). I will get this number.
Get the row count, using wc -l, and then parse to get only the numeric part.
The I calculate the rows per split file (I do it this way, because if I split the file just by size then there may be a possibility that the last row may get split). (Formula used: Total Row Count divided by number of files to split. For example, if I have file with 1000 rows, and the size is 100 GB (which means I want to split them into 20 files). So this would give me 1000 divided by 20 equals 50. So this will create 20 files with 50 rows each.
The round the result from above to a whole number (higher end).
The I use the split command (a simple script) - This will take the file name and the row count for each split file as input argument. This script will first check for correct I/p argument, then creates a directory where I can place the split files, and then uses the split command as

split -l rowcount file_name /destination/directory/forsplitfile/data_

this will split my actual input files and then store in into the folder "forsplitfile" with prefix "data_"

Then I check if all the split file size equals the actual file size, if Yes, then continue, if no, then abort.
Then I use my load job to read each split file and load them one by one. once all the files are done, the job completes.
If the job fails in between (say split_file_5) then when I re-start my job, it will pick up from where it failed. (split_file_5).

I know this is too much, but can somebody help me put this in a script. I can do this in DataStage with some UNIX commands, but I think what I'm doing wont be a stable solution. So can somebody help me please.

Thanks

---------- Post updated at 03:33 PM ---------- Previous update was at 03:30 PM ----------

Also how do I add checkpoint in the shell script, so when I restart, it can start from where it failed.

Chubler_XL · November 22, 2014, 5:49pm

on AIX you could try something like the below script.

cd /fol1/fol2 
for file in *.dat
do
   SIZE=$(istat "$file"| awk '/Length/ {print $(NF-1)}')
   if [ $SIZE -gt 90000000000 ]
   then
       LINES=$(wc -l < "$file")
       let AVG=SIZE/LINES
       let SPLIT=5000000000/AVG
       split -l $SPLIT -a1 --verbose "$file" "TT_$file"
   fi
done

Note this reads the file twice (once to calculate the number of lines and once to split the file up), also using the average line size is quite inaccurate and shell integer arithmetic could make this even more hit-and-miss. You may be better off writing a tailor split command in awk.

kpk_ds · November 23, 2014, 2:43am

Hi Chubler_XL

Thanks, I will try this now. Will let you know how it works. I will put this in a script and can call from my DataStage process. Let me write this as a script and then I will post the results here.

In the mean while if others have any other suggestions, please keep posting, I will try everything. This is really great.

Thanks again for every ones help.

---------- Post updated 11-23-14 at 02:06 AM ---------- Previous update was 11-22-14 at 07:38 PM ----------

Hi, this is what I have done.
I'm using a 3 GB file to test my process. But the script hangs after: "echo "Checking ${2} file size now:""

Not sure what to do, please correct me if I've done something wrong.
Here is the script:

 
#!/bin/bash
# usage:

# check for input:
if [ ! $# == 3 ]; then
  echo "Input Parameter missing."
fi
#Main Logic Begins:
clear
#echo Input Parameters:
echo "**********************************************************************************************"
echo "Main Source file is located in: $1 \n"
echo "Currently processing file: $2 \n"
echo "All the split files will be located at: $3 \n"
echo "**********************************************************************************************"
#Check if Split file directory exists:
if [ -d "${3}" ];
then
 
 echo "Split file directory Exist, So deleting Directory and its contents \n"
 rm -rf ${3};
else
 echo "No Split file directory present \n";
fi
 
# Create New directory to place split files
echo
echo "Create New directory to place split files. \n"
mkdir ${3}
chmod 777 ${3}
if [ -d "${3}" ];
then
 echo "Split file directory created successfully \n"
 echo "Split file directory Permission set to 777 \n"
else
 echo "Split File Directory creation failed \n";
fi
# Check input file size:
echo "Checking ${2} file size now:"
for ifile in ${2}
do
 ipsize=$(istat "$ifile" | awk '/Length/ {print $(NF-1)}')
 echo "Total file size in Byetes: $ipsize \n"
 if [ $ipsize -gt 1000000000 ]
 then
  lines=$(wc -l < "$iflie")
  let avg=ipsize/lines
  let splitcount=5000000000/avg
  split -l $splitcount -a1 -verbose "$ifile" "${3}/TT_$2"
 fi
done

echo "Total Row Count in ${2}: $lines \n"
echo "Average Row lenght in ${2}: $avg \n"
echo "Row count per split file is: $splitcount \n"
echo "Total split files and row counts \n"
wc -l ${3}/TT_$2*

---------- Post updated at 02:43 AM ---------- Previous update was at 02:06 AM ----------

Hi

I made some changes to the script, since the split command didn't work properly, now its working fine:

 
#!/bin/bash
# usage:
# sh ./[script] [inputfile] [row count]
# check for input:
if [ ! $# == 3 ]; then
  echo "Input Parameter missing."
fi
#Main Logic Begins:
clear
#echo Input Parameters:
echo "**********************************************************************************************"
echo "Main Source file is located in: $1 \n"
echo "Currently processing file: $2 \n"
echo "All the split files will be located at: $3 \n"
echo "**********************************************************************************************"
#Check if Split file directory exists:
if [ -d "${3}" ];
then
 
 echo "Split file directory Exist, So deleting Directory and its contents \n"
 rm -rf ${3};
else
 echo "No Split file directory present \n";
fi
 
# Create New directory to place split files
echo
echo "Create New directory to place split files. \n"
mkdir ${3}
chmod 777 ${3}
if [ -d "${3}" ];
then
 echo "Split file directory created successfully \n"
 echo "Split file directory Permission set to 777 \n"
else
 echo "Split File Directory creation failed \n";
fi
# Check input file size:
echo "Checking ${2} file size now:"
for ifile in ${2}
do
 ipsize=$(istat "$ifile" | awk '/Length/ {print $(NF-1)}')
 echo "Total file size in Byetes: $ipsize \n"
 
 if [ $ipsize -gt 1000000000 ]
 then
 
   lines=$(wc -l < "$ifile")
   echo "Total Row Count in ${2}: $lines \n"
 
  let avg=`expr ${ipsize} / ${lines}`
   echo "Average Row lenght in ${2}: $avg \n"  
 
  let splitcount=1000000000/avg
   echo "Row count per split file is: $splitcount \n"  
 
  split -l $splitcount "$ifile" "${3}/TT_$2"
  #-a1 --verbose 
 echo "Total split files and row counts \n"
 wc -l ${3}/TT_$2*

 fi
done

and then I get the following results:

**********************************************************************************************
Main Source file is located in: /some/dir/path
Currently processing file: inputfile.dat
All the split files will be located at: /some/dir/path/splitdir
**********************************************************************************************
Split file directory Exist, So deleting Directory and its contents

Create New directory to place split files.
Split file directory created successfully
Split file directory Permission set to 777
Checking inputfile.dat file size now:
Total file size in Byetes: 3329056768
Total Row Count in inputfile.dat:  2684723
Average Row lenght in inputfile.dat: 1240
Row count per split file is: 806451
Total split files and row counts
  806451 /some/dir/path/splitdir/TT_inputfile.dataa
  806451 /some/dir/path/splitdir/TT_inputfile.datab
  806451 /some/dir/path/splitdir/TT_inputfile.datac
  265370 /some/dir/path/splitdir/TT_inputfile.datad
 2684723 total

Can somebody help me how to add additional features like, log all the messages or steps, then if the file size is less than 1 GB, then I want to send a note that file size is less than 1GB and exit. Also when ever this script fails, I want to capture all the steps that were executed, and then send it in email.

thanks

RudiC · November 23, 2014, 7:36am

Depending on which shell you use and what commands are installed on your system, and positional parameter 1 holding the file name, 2 the target work file size, you could use for items (in post #7)

 2. - 5.: read LN CH << EOF
          $(wc -lc < $1)
          EOF
 6. - 7.: split -l$(($2 *LN / CH)) -d $1 workfile
 8.:      echo $(($(stat -c"%s+" workfile*) 0))
10.:      rename the workfiles when finished with them so the next iteration will pick up from where you left.

Chubler_XL · November 23, 2014, 5:37pm

Here is an awk script to split file into pieces < MAX bytes without splitting lines:

awk \
   -v MAX=1000000000 \
   -v oname="${3}/TT_$2" \
   -v digits=2 \
'
function suffix(val,ret) {
 for(i=digits; i; i--) {
     ret = sprintf("%c", 97 + (val%26)) ret
     val=int(val/26)
 }
 return ret
}
BEGIN {sz=MAX}
{
  line_len=length()+1
  if (sz + line_len > MAX) {
     if(outfile != "") {
         close(outfile)
         print "  " row " " sz " " outfile
     }
     outfile=(oname?oname:FILENAME) suffix(Fnum++)
     row=sz=0
  }
  row++
  sz+=line_len
  fs+=line_len
  print > outfile
}
END {
 print "  " row " " sz " " outfile
 print "  " NR " " fs " total"
}' "$ifile"