Split file using awk

rosebud123 · August 25, 2018, 10:00pm

I need to split the incoming source file in to multiple files using awk.

Split position is (6,13) : 8 positions

All the records that are greater than 20170101 and less than or equal to 20181231 should go in a split file with file name as source filename_greaterthan_20170101_lessthan_20181231 + yyyymmddhhmmss

All records that are less than 20170101 should go in a file with file name as source filename_lessthan_20170101 + yyyymmddhhmmss

All records that are greater than 20190101 should go in a file with file name as source filename_20190101 + yyyymmddhhmmss

Additionally instead of hard coding the condition in the script/command, can we pass it as a variable to the script , so the script remains dynamic.

Source File:

001  20991231
002  20190101
003  20231231
004  20231231
005  20261231
006  20271231
007  20281231
008  20301231
009  20161231
010  20161230
011  20161010
012  19880101
013  20000101
014  20110121
015  20130121
016  20170121
017  19870121

Scrutinizer · August 26, 2018, 1:46am

Try and adjust something like:

awk '{y=substr($2,1,4); f=b} y<lt{f=a} y>gt{f=c} {print>f} ' lt=2017 gt=2018 a=y1 b=y2 c=y3 infile

This should split the input file into the files y1, y2 and y3.

rosebud123 · August 26, 2018, 11:25am

Thanks..

Can you please explain the code in few lines

Scrutinizer · August 26, 2018, 2:26pm

Sure:

awk '                                            
  {                                            
    y=substr($2,1,4)                              # Set the variable y to first 4 characters of 
                                                  # the second field of the input file                        
    f=b                                           # set the output to the name in variable b
  }                                            
  y<lt {                                          # if the year is less than the min treshold  
    f=a                                           # set the variable f to the name in variable a
  }                                             
  y>gt {                                          # if the year is more than the max treshold 
    f=c                                           # set the variable f to the name in variable c
  }                                            
  {                                            
    print>f                                       # print the line to the appropriate file          
  }                                            
' lt=2017 gt=2018 a=y1 b=y2 c=y3 infile           # set variables lt, gt, a, b, and c and specify file name.

rosebud123 · August 27, 2018, 10:24am

Thanks.

Is there a way to not hard code 2017 and 2018 , rather pass them as a parameters ?

vgersh99 · August 27, 2018, 10:39am

they are being passed as parameters.

RudiC · August 27, 2018, 10:56am

I think s/he means shell variables

awk '...' lt="${PAR1}" gt="${PAR2}"

rosebud123 · August 27, 2018, 12:27pm

Correct...How to run the AWK inside a shell script by passing parameters

RudiC · August 27, 2018, 1:39pm

Should be clear now, no?

rosebud123 · August 29, 2018, 12:02am

Thanks.

I adjusted the script as below

#!/usr/bin/ksh
PAR1=$1
PAR2=$2
PAR3=$3

awk '{y=substr($2,1,4); f=b} y<lt{f=a} y>gt{f=c} {print>f} ' lt="${PAR1}" gt="${PAR2}" a=y1 b=y2 c=y3 $PAR3

Few more adjustments I need to make.

How to append input filename portion to the resulting files y1,y2,y3

Example : Input FileName = abc_123~xyz

Desired Output FileName = abc_y1_123~xyz

At this point I do not whether it would be coulum 2 that I have to check , instead of including column 2 can I just go by position on the substr ?

Example :

text awk '{y=substr($0,1,4); f=b} y<lt{f=a} y>gt{f=c} {print>f} ' lt="${PAR1}" gt="${PAR2}" a=y1 b=y2 c=y3 $PAR3

Please advise on the above...Thanks

RudiC · August 29, 2018, 3:47am

a) appending / prefixing the actual file name to the output file name would be way easier than inserting into the yn string. Try f=a FILENAME etc. If not happy with this, construct the f variable with a few substr() calls...

b) feel free to adjust the selection criteria to whatever you desire, but note that your above idea would not yield identical results, as $2 starts at char position 6 in your sample.

rosebud123 · September 1, 2018, 5:24pm

Thanks