Splitting the file using awk

sagar.cumar · December 10, 2012, 1:04am

Hi,

I have a requirement in which I am going to receive one file and should be splitted to 9 different files based on one distinguisher called TYPE.

I heard that this can be done using awk or sed.

Can any one advise regardint the logic and simpler way other than using awk or sed is also fine.

The volume of data expecting is 5GB daily file.

Thank you in advance.

Regards,
Sagar.

pravin27 · December 10, 2012, 1:09am

cat file1
Type    value
A       1
A       2
C       3
D       5
A       9
D       8
Z       5

 awk 'NR>1{print $0 >$1".txt"}'  file1

sagar.cumar · December 10, 2012, 1:15am

Hi Pravin,

Thanks for quick response. What is NR in awk command. Could you please bit more clear on the command.

Also, is there any performance issue as daily we are going to get 5GB of data.

Thank you in advance.

Regards,
Sagar

bmk · December 10, 2012, 2:16am

Hi, if you use SPLIT command ,it will be easy.

Vikram_Tanwar12 · December 10, 2012, 5:01am

Awk NR

gives you the total number of records being processed or line number.

You can check this link to understand it more clearly

8 Powerful Awk Built-in Variables � FS, OFS, RS, ORS, NR, NF, FILENAME, FNR (link removed)

sagar.cumar · December 10, 2012, 5:20am

Hi bmk,

Any examples please for split command.

Hi Vikram,

Thanks for the link and it is very much useful. But, the said variables at field level.

Regards,
Sagar.

bmk · December 10, 2012, 6:36am

Try like...

 ----Split the file into 2 files
maxline=17 ---File count/2
split -l $maxline -d test1.txt L
for i in L??
do
echo "1"
mv "$i" "$i"test
done

sagar.cumar · December 10, 2012, 6:28pm

Hi All,
To make things clear, below is my actual requirement. Sorry for giving more details lately.
Could you please help me in getting this quickly.

Thanks a lot in advance.

Requirement Details:-
Input File Details
File Name ---> INFILE.TXT
File Content --->

V|xxxxxxxxxxxxx|01/12/2012|0411111111
V|yyyyyyyyyyyy|01/12/2012|0422222222
V|zzzzzzzzzzz|01/12/2012|0388888888
V|aaaaaaaaaaaa|01/12/2012|0388888888
M|xxxxxxxxxxxxx|01/12/2012|0411111111|20
M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
M|bbbbbbbbbb|01/12/2012|0299999999|50.5
D|xxxxxxxxxxxxx|0411111111
D|yyyyyyyyyyyy|0422222222
D|zzzzzzzzzzz|0388888888
D|aaaaaaaaaaaa|0388888888
D|bbbbbbbbbb|0299999999
D|yasdfasdfasdfasd|0299999999
O|xxxxxxxxxxxxx|0411111111|canada
O|yyyyyyyyyyyy|0422222222|canada
O|zzzzzzzzzzz|0388888888|USA
O|aaaaaaaaaaaa|0388888888|UK
O|bbbbbbbbbb|0299999999|Behrain

Requirement is to split this input file based on first character as all Vs should go one out file, all Ms should go to second out file and so on.
The output files should like below:
OUTFILE_V.TXT:-

V|xxxxxxxxxxxxx|01/12/2012|0411111111
V|yyyyyyyyyyyy|01/12/2012|0422222222
V|zzzzzzzzzzz|01/12/2012|0388888888
V|aaaaaaaaaaaa|01/12/2012|0388888888

OUTFILE_M.TXT:-

M|xxxxxxxxxxxxx|01/12/2012|0411111111|20
M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
M|bbbbbbbbbb|01/12/2012|0299999999|50.5

OUTFILE_D.TXT:-

D|xxxxxxxxxxxxx|0411111111
D|yyyyyyyyyyyy|0422222222
D|zzzzzzzzzzz|0388888888
D|aaaaaaaaaaaa|0388888888
D|bbbbbbbbbb|0299999999
D|yasdfasdfasdfasd|0299999999

OUTFILE_O.TXT:-

O|xxxxxxxxxxxxx|0411111111|canada
O|yyyyyyyyyyyy|0422222222|canada
O|zzzzzzzzzzz|0388888888|USA
O|aaaaaaaaaaaa|0388888888|UK
O|bbbbbbbbbb|0299999999|Behrain

Regards,
Sagar

vgersh99 · December 10, 2012, 6:49pm

awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >o;close(o)}' INFILE.TXT

sagar.cumar · December 10, 2012, 6:59pm

Hi vgersh99,

Thanks for your quick reply. But, I am getting only one record in each output file.

Could you please advise further to get all records in the output file.

Regards,
pavan

vgersh99 · December 10, 2012, 7:07pm

sorry...

awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >>o;close(o)}' INFILE.TXT

sagar.cumar · December 10, 2012, 7:48pm

Hi vgersh99,

Thanks a lot for your quick reply. I shall further test and come back if I see any issues.

Regards,
Sagar.

---------- Post updated at 07:48 PM ---------- Previous update was at 07:45 PM ----------

Hi vgersh99,

One more thing is that my file is 4GB of size and I need to split this file into 9 different files based on one distinguishing code.

Could you please advise whether I see any performance issue.

If in that case, is there any other way to achieve the same result.

Regards,
Sagar.

vgersh99 · December 10, 2012, 7:51pm

if you can guarantee that the number of unique values in the first column is less than 10, than you optimize the code a bit more saving on the 'close':

awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >>o}' INFILE.TXT

test and see...

sagar.cumar · December 10, 2012, 8:00pm

Hi vgersh99,

Below is the expectation from the source system:
"Code Type" is the distinguishing field in the input file from source.
We are going to receive 9 different Code Types from the source.
Total Size of file is expected to be 4GB to 5GB on daily basis.

In the said scenario, I will see any performance issues?

Thanks a lot for all your inputs.

Regards,
Sagar.

sagar.cumar · December 12, 2012, 5:16pm

Hi vgersh99,

I have test regarding the performance by taking 9.9 GB file with a split of 9 files. Its excellent, you made my day.

Could you please advise on below:

eliminating header and trailer record.
distinguishing type is not in first column but in 5th column.

Input File looks like below:

HEADER|DC_MUR_INFILE_TEST_13DEC2012.DAT
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|M|bbbbbbbbbb|01/12/2012|0299999999|50.5
0411111111|0411111111|xxxxxxxxxxxxx|13/12/2012 9:08:39|D|xxxxxxxxxxxxx|0411111111
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|D|yyyyyyyyyyyy|0422222222
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|D|zzzzzzzzzzz|0388888888
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|D|aaaaaaaaaaaa|0388888888
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|D|bbbbbbbbbb|0299999999
0299999999|0299999999|yasdfasdfasdfasd|13/12/2012 9:08:39|D|yasdfasdfasdfasd|0299999999
0411111111|0411111111|xxxxxxxxxxxxx|13/12/2012 9:08:39|O|xxxxxxxxxxxxx|0411111111|canada
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|O|yyyyyyyyyyyy|0422222222|canada
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|O|zzzzzzzzzzz|0388888888|USA
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|O|aaaaaaaaaaaa|0388888888|UK
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|O|bbbbbbbbbb|0299999999|Behrain
TRAILER|0000000015

Thanks a lot in advance.
Regards,
Sagar.

vgersh99 · December 12, 2012, 6:43pm

assuming header/footer don't start with numbers...

awk -F'|' '/^[0-9]/{o="OUTPUT_" $5 ".TXT";print $0 >>o}' INFILE.TXT