Split file based on records

Ajay_Venkatesan · July 11, 2013, 5:58am

I have to split a file based on number of lines and the below command works fine:

split -l 2 Inputfile -d Outputfile

My input file contains header, detail and trailor info as below:

H
D
D
D
D
T

My split files for the above command contains:
First File:

H
D

Second File:

D
D

Third File

D
T

But i do not want the H and T records. I want only the detail record.

Can any one help me on achieving this.

Thanks,
Ajay

vidyadhar85 · July 11, 2013, 6:10am

you can use below awk to get rid of H and T

 
awk -v nl=$(wc -l < filename) 'NR>1 && NR<nl{print $0}' filename

However you can pass it to split when its using -d option as it might throw error so i suggest forming a temp file or use diff method other than split to create the files.

Ajay_Venkatesan · July 12, 2013, 2:24am

Hi Vidyadhar,

Thanks for the reply, this works but when I do not have the header or trailor record in the input file, even then it strips the first and the last line. I wanted to split the files based on the TAGS. Only for lines which starts with a substring say "D", i want to split the file.

I thought of the idea of doing it with temp file but since I am dealing with huge files upto 30GB, taking it to a temp will take more time.

Is there any other solution for this.

Thanks,
Ajay

vidyadhar85 · July 12, 2013, 2:30am

Can you give your exact requirements?

Whats the exact string?
What and how will u decide its header or trailer?
How many records in per file? if you go for 2 records per file for 30GB file you end up consuming all inum

Ajay_Venkatesan · July 12, 2013, 2:40am

Ok. I will have Header, Detail and Trailor records, of which Detail will looping (infinite) and Heasder/Trailor will not loop and they might of migth not occur in the inuput (In other workd they are not mandatory). I want to split the file the file based on a particualr number of lines only for the Detail records, omiting the Header and Trailor records.

I can as well delete the header and trailor records and then split the file. But i require them for other purpose.

If you require any clarification, please do ask me.

Thanks,
Ajay

---------- Post updated at 12:10 PM ---------- Previous update was at 12:09 PM ----------

The records will be identified based on the TAGS.

rajamadhavan · July 12, 2013, 3:07am

How about this ? Assuming header is the first line and trailer is the last line

sed -e '1d' -e '$d' file  | split -l2

Ajay_Venkatesan · July 12, 2013, 3:13am

Hi Raja,adhvan,

This work. But when there is no Header and trailor in the file. Then this will strip off the first record which acuallly would have been the detail.

Instead i tried someting like this:

awk '{if (substr($0,1,1) == "D") {print $0;}}' Filename

This gets you the list of detail records. But i am not sure how to pass the output of this command to the split command?
Any suggetions on that?

Thanks,
Ajay

rajamadhavan · July 12, 2013, 3:16am

looking at your example, you could use sed to do the pattern match and pipe it to split

sed -n /^D/p file | split -l2

Ajay_Venkatesan · July 12, 2013, 5:02am

Hi Rajamadhavan,

This works cool. Just few confirmation required.

I guess the ^ is escape character, Am i right?
Secondly, the split command is creating the files with names xaa,xab and so on.
I want to give specific name and i tried giving the below command:

sed -n /^D/p file | split -l 2 -d test

This fails saying that, "test" does not exist. Where in i want to create file names as test01,test02 and so on.

Can you suggest where to mention the spit file name.

Thanks,
Ajay

rajamadhavan · July 12, 2013, 5:12am

^ is the regex indicating the beginning of the line.

This will change the prefix from 'x' to 'test'. As far as I know, you can only change the prefix and not the suffix. So the split files will be testaa, testab and so on

sed -n /^D/p file | split -l2 - test

Ajay_Venkatesan · July 12, 2013, 5:26am

Thanks ton Rajamadhavan

It worked. And we are also able to change the suffix also:

sed -n /^$a/p file | split -l 2 -d - test

And my output is test00, test01 and so on.

Thanks again for ur help.

Thanks,
Ajay

rajamadhavan · July 12, 2013, 6:29am

Ok. -d option is not available on split on my system.