Splitting the file

ginrkf · November 18, 2013, 1:46am

I have a file with around 10 million records.

Please find the sample data below

123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS
145679|8UIH|BIRD|BLINK|15T26|90807|ZTS

My requirement is I want to split the file based on the first column.
For the first column which is having the same set of values will go to one file like that.
So in the above data
First three records will go to file 1

123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS

this will go to file2

125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS

this will go to file3

145679|8UIH|BIRD|BLINK|15T26|90807|ZTS

But the problem here is the number of record with the same value for the first column can vary.
FOr example in the above sample data I show three records with same value.
It can be either 3 or 4 or 100 or any number.Same for the other set of records also

Jotne · November 18, 2013, 1:58am

Simple approach:

awk -F\| '{print $0 > $1".txt"}' file

Does the number in column #1 always comes in groups, or will you find eks 123456 further down in the file after other data?
If there are many records, files should be closed.

EDIT:
This should close the file while field #1 changes

awk -F\| 'f!=$1 {close (f".txt")} {print $0 > $1".txt";f=$1} END {close (f".txt")}' file

ginrkf · November 18, 2013, 2:35am

thanks its working fine.The records will always come in group only.

But there is another issue.if we have around 77k same set of records , it will create 77k files.Actually I don't want to create that much files.I can combine the files and want to make it three or four max.But the same set of records shouldn't get split in two files.

Jotne · November 18, 2013, 2:46am

We can use only part of the first filed to create larger groups. So if you show an example of group, we can show you how it can be done. Exs 2 first digit.
Here is en example on 2 first digit:

awk -F\| 'f!=substr($1,1,2) {close (f".txt")} {print $0 > substr($1,1,2)".txt";f=substr($1,1,2)} END {close (f".txt")}' file

ginrkf · November 18, 2013, 2:57am

For Ex.

123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS
145679|8UIH|BIRD|BLINK|15T26|90807|ZTS
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS

I can combine all the first set and second set into one file.And we can combine as many records into one file still the file size become 500000 records.

But we should take care one thing that, the same set of records shouldn't get split into two files.

For Ex.

123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS

In the above sample data , the first two records into one file and other into a different file.That shouldn't be happen anywhere.

gpk_newbie · November 18, 2013, 3:15am

The below code will first check the unique patterns in the first column and saves it to a file. it then checks for the unique pattern in the input file and stores all records matching pattern in a file named with the pattern

cut -d'|' -f1 input_file | uniq > final
 
while read line
do
grep "$line" input_file >> "$line".txt
done < final

once the above code is executed it results in 3 files( for the example in ques)

123456.txt
125828.txt
145679.txt

result in the file as below.

more 123456.txt
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
 
more 125828.txt
125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS
 
more 145679.txt
145679|8UIH|BIRD|BLINK|15T26|90807|ZTS

Jotne · November 18, 2013, 3:16am

Not sure what you like, but to split the file into files with 500000 records in each file.

awk 'NR%500000==1 {close (++a".txt")} {print $0 > sprintf("%06d",a)".txt"}' file

disedorgue · November 18, 2013, 12:34pm

Hi,
Another while loop:

$ ls
input_file
$ cat input_file
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS
145679|8UIH|BIRD|BLINK|15T26|90807|ZTS
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS

$ while read a ; do echo "$a" >>${a/|*/}.txt ; done <input_file

$ ls
123456.txt  125828.txt  145679.txt  input_file
$ cat 123456.txt
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
123456|ASDF|WORD|MIND|456890|40050|RTS
123456|9UIL|WORD|BLINK|15G26|43215|GTS
123456|9UIL|WORD|BLINK|15G26|43215|BTS
$ cat 125828.txt
125828|9UIH|WIRD|BLANK|15G26|45215|NTS
125828|9UIH|WIRD|BLANK|15G26|47215|PTS
$ cat 145679.txt
145679|8UIH|BIRD|BLINK|15T26|90807|ZTS

Regards.