Hi,
I have a requirement in which I am going to receive one file and should be splitted to 9 different files based on one distinguisher called TYPE.
I heard that this can be done using awk or sed.
Can any one advise regardint the logic and simpler way other than using awk or sed is also fine.
The volume of data expecting is 5GB daily file.
Thank you in advance.
Regards,
Sagar.
cat file1
Type value
A 1
A 2
C 3
D 5
A 9
D 8
Z 5
awk 'NR>1{print $0 >$1".txt"}' file1
Hi Pravin,
Thanks for quick response. What is NR in awk command. Could you please bit more clear on the command.
Also, is there any performance issue as daily we are going to get 5GB of data.
Thank you in advance.
Regards,
Sagar
bmk
December 10, 2012, 2:16am
4
Hi, if you use SPLIT command ,it will be easy.
Awk NR
gives you the total number of records being processed or line number.
You can check this link to understand it more clearly
8 Powerful Awk Built-in Variables � FS, OFS, RS, ORS, NR, NF, FILENAME, FNR (link removed)
Hi bmk,
Any examples please for split command.
Hi Vikram,
Thanks for the link and it is very much useful. But, the said variables at field level.
Regards,
Sagar.
bmk
December 10, 2012, 6:36am
7
Try like...
----Split the file into 2 files
maxline=17 ---File count/2
split -l $maxline -d test1.txt L
for i in L??
do
echo "1"
mv "$i" "$i"test
done
Hi All,
To make things clear, below is my actual requirement. Sorry for giving more details lately.
Could you please help me in getting this quickly.
Thanks a lot in advance.
Requirement Details:-
Input File Details
File Name ---> INFILE.TXT
File Content --->
V|xxxxxxxxxxxxx|01/12/2012|0411111111
V|yyyyyyyyyyyy|01/12/2012|0422222222
V|zzzzzzzzzzz|01/12/2012|0388888888
V|aaaaaaaaaaaa|01/12/2012|0388888888
M|xxxxxxxxxxxxx|01/12/2012|0411111111|20
M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
M|bbbbbbbbbb|01/12/2012|0299999999|50.5
D|xxxxxxxxxxxxx|0411111111
D|yyyyyyyyyyyy|0422222222
D|zzzzzzzzzzz|0388888888
D|aaaaaaaaaaaa|0388888888
D|bbbbbbbbbb|0299999999
D|yasdfasdfasdfasd|0299999999
O|xxxxxxxxxxxxx|0411111111|canada
O|yyyyyyyyyyyy|0422222222|canada
O|zzzzzzzzzzz|0388888888|USA
O|aaaaaaaaaaaa|0388888888|UK
O|bbbbbbbbbb|0299999999|Behrain
Requirement is to split this input file based on first character as all Vs should go one out file, all Ms should go to second out file and so on.
The output files should like below:
OUTFILE_V.TXT:-
V|xxxxxxxxxxxxx|01/12/2012|0411111111
V|yyyyyyyyyyyy|01/12/2012|0422222222
V|zzzzzzzzzzz|01/12/2012|0388888888
V|aaaaaaaaaaaa|01/12/2012|0388888888
OUTFILE_M.TXT:-
M|xxxxxxxxxxxxx|01/12/2012|0411111111|20
M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
M|bbbbbbbbbb|01/12/2012|0299999999|50.5
OUTFILE_D.TXT:-
D|xxxxxxxxxxxxx|0411111111
D|yyyyyyyyyyyy|0422222222
D|zzzzzzzzzzz|0388888888
D|aaaaaaaaaaaa|0388888888
D|bbbbbbbbbb|0299999999
D|yasdfasdfasdfasd|0299999999
OUTFILE_O.TXT:-
O|xxxxxxxxxxxxx|0411111111|canada
O|yyyyyyyyyyyy|0422222222|canada
O|zzzzzzzzzzz|0388888888|USA
O|aaaaaaaaaaaa|0388888888|UK
O|bbbbbbbbbb|0299999999|Behrain
Regards,
Sagar
awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >o;close(o)}' INFILE.TXT
Hi vgersh99,
Thanks for your quick reply. But, I am getting only one record in each output file.
Could you please advise further to get all records in the output file.
Regards,
pavan
sorry...
awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >>o;close(o)}' INFILE.TXT
Hi vgersh99,
Thanks a lot for your quick reply. I shall further test and come back if I see any issues.
Regards,
Sagar.
---------- Post updated at 07:48 PM ---------- Previous update was at 07:45 PM ----------
Hi vgersh99,
One more thing is that my file is 4GB of size and I need to split this file into 9 different files based on one distinguishing code.
Could you please advise whether I see any performance issue.
If in that case, is there any other way to achieve the same result.
Regards,
Sagar.
if you can guarantee that the number of unique values in the first column is less than 10, than you optimize the code a bit more saving on the 'close':
awk -F'|' '{o="OUTPUT_" $1 ".TXT";print $0 >>o}' INFILE.TXT
test and see...
Hi vgersh99,
Below is the expectation from the source system:
"Code Type" is the distinguishing field in the input file from source.
We are going to receive 9 different Code Types from the source.
Total Size of file is expected to be 4GB to 5GB on daily basis.
In the said scenario, I will see any performance issues?
Thanks a lot for all your inputs.
Regards,
Sagar.
Hi vgersh99,
I have test regarding the performance by taking 9.9 GB file with a split of 9 files. Its excellent, you made my day.
Could you please advise on below:
eliminating header and trailer record.
distinguishing type is not in first column but in 5th column.
Input File looks like below:
HEADER|DC_MUR_INFILE_TEST_13DEC2012.DAT
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|M|yyyyyyyyyyyy|01/12/2012|0422222222|25.5
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|M|zzzzzzzzzzz|01/12/2012|0388888888|30.2
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|M|aaaaaaaaaaaa|01/12/2012|0388888888|20.2
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|M|bbbbbbbbbb|01/12/2012|0299999999|50.5
0411111111|0411111111|xxxxxxxxxxxxx|13/12/2012 9:08:39|D|xxxxxxxxxxxxx|0411111111
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|D|yyyyyyyyyyyy|0422222222
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|D|zzzzzzzzzzz|0388888888
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|D|aaaaaaaaaaaa|0388888888
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|D|bbbbbbbbbb|0299999999
0299999999|0299999999|yasdfasdfasdfasd|13/12/2012 9:08:39|D|yasdfasdfasdfasd|0299999999
0411111111|0411111111|xxxxxxxxxxxxx|13/12/2012 9:08:39|O|xxxxxxxxxxxxx|0411111111|canada
0422222222|0422222222|yyyyyyyyyyyy|13/12/2012 9:08:39|O|yyyyyyyyyyyy|0422222222|canada
0388888888|0388888888|zzzzzzzzzzz|13/12/2012 9:08:39|O|zzzzzzzzzzz|0388888888|USA
0388888888|0388888888|aaaaaaaaaaaa|13/12/2012 9:08:39|O|aaaaaaaaaaaa|0388888888|UK
0299999999|0299999999|bbbbbbbbbb|13/12/2012 9:08:39|O|bbbbbbbbbb|0299999999|Behrain
TRAILER|0000000015
Thanks a lot in advance.
Regards,
Sagar.
assuming header/footer don't start with numbers...
awk -F'|' '/^[0-9]/{o="OUTPUT_" $5 ".TXT";print $0 >>o}' INFILE.TXT