Split the file based on column

sol_nov · January 16, 2013, 2:16pm

Hi,

I have a file sample_1.txt (300k rows) which has data like below:

Also each record is around 64k bytes

11|1|abc|102553|125589|64k bytes of data
10|2|def|123452|123356|......
13|2|geh|144351|121123|...
25|4|fgh|165250|118890|..
14|1|abc|186149|116657|......
21|7|def|207048|114424|......
23|7|geh|227947|112191|......
26|32|fgh|248846|109958|......
27|23|abc|269745|107725|......
29|34|def|290644|105492|......
30|32|geh|311543|103259|......
33|23|fgh|332442|101026|......
35|34|abc|353341|98793|......
37|7|def|374240|96560|......
39|4|geh|395139|94327|......
41|2|fgh|416038|92094|......
44|23|abc|436937|89861|......
46|1|def|457836|87628|......
48|3|geh|478735|85395|......
50|23|fgh|499634|83162|......

I am trying to split the files based on the 2nd column like below
sample_1_1.txt

11|1|abc|102553|125589|......
14|1|abc|186149|116657|......
46|1|def|457836|87628|......

sample_1_2.txt

10|2|def|123452|123356|......
13|2|geh|144351|121123|......
41|2|fgh|416038|92094|......

sample_1_3.txt

48|3|geh|478735|85395|......

and so on

Could some help me on this.

Thanks in advance

Yoda · January 16, 2013, 2:23pm

awk -F\| '{ filename="sample_1_"$2".txt"; print $0 > filename } ' sample_1.txt

mirwasim · January 16, 2013, 2:28pm

please check this way

grep "|1|" filename >> filename --> change |1| and redirect to file

---------- Post updated at 12:58 AM ---------- Previous update was at 12:54 AM ----------

sol_nov · January 16, 2013, 2:39pm

Thanks, But I have the record size around 64000 bytes, so it is failing with too long.

Can this be done using perl?

Yoda · January 16, 2013, 3:10pm

How about a KSH script?

#!/bin/ksh

while IFS="|" read f1 f2 f3
do
        echo "$f1|$f2|$f3" >> sample_1_$f2.txt
done < sample_1.txt

Scrutinizer · January 16, 2013, 3:39pm

An awk version that might circumvent the line limit get's a bit complicated, but you could try this:

awk '
  NR==1 && getline n {
    f="sample_1_" n ".txt"
    print $1 >> f
    print RS n >> f
    next
  }  
  NF==2 && getline n {
    print RS $1 FS >> f
    close(f)
    f="sample_1_" n ".txt"
    print $2  >> f
    print RS n  >> f
    next
  }
  {
    print RS $1 >> f
  } 
  END{
    print FS >> f
  }
' RS=\| ORS= FS='\n' infile

Don_Cragun · January 16, 2013, 5:41pm

Any utility specified to read text files (including awk, grep, read, and sed) may fail on any line longer than LINE_MAX bytes long. The value of LINE_MAX on your system can be found by running the command: getconf LINE_MAX . The cut, paste, and fold utilities, however, are required to work with text files with unlimited line lengths. So, a way to do this is to:

Use cut to create a file just containing field 2 from your intput file into a file (e.g., name_list).
Use cut to create a file with the first LINE_MAX-5 bytes from of your input file into a file (e.g., part001).
Use cut to create other files with sequential sets of LINE_MAX-5 bytes from your input file (e.g., part002 ... partXXX) such that every of part of your input file has been split into a file with lines less than LINE_MAX bytes long.
Read name_list and calculate the name of the file to contain the reassembled input line.
Read a line from each of the partXXX files and write it to the appropriate output file. (Note that the writes may have to be done as a separate write for each partXXX file line adding a trailing newline character to the write of the last partXXX file.) You could also create separate output_field2_partXXX files, and use paste to create the final output files from these intermediate files.