Split file by column value, each with header

senhia83 · November 14, 2016, 5:20pm

Hello all,

I have a csv with with different testcase values in column 5.

year,min,max,Instrument,Testcase
2016,201,1003,GEOTROPH-02116,TATA7980
2016,53,1011,GEOTROPH-01963,TATA7980
2016,3,1024,GEOTROPH-02067,TATA7980
2016,203,1027,GEOTROPH-02011,TATA7980
2016,135,1016,GEOTROPH-01932,TATA3405
2016,71,1036,GEOTROPH-02030,TATA7980
2016,173,1046,GEOTROPH-02033,TATA7980
2016,183,1015,GEOTROPH-02120,TATA7980
2016,47,1037,GEOTROPH-02154,TATA7980
2016,63,1017,GEOTROPH-01959,TATA7980
2016,73,1019,GEOTROPH-01966,TATA7980
2016,65,1023,GEOTROPH-01931,TATA3405
2016,317,1012,GEOTROPH-02077,TATA7980
2016,9,1027,GEOTROPH-02089,TATA7980
2016,61,1018,GEOTROPH-01947,TATA7980
2016,341,1029,GEOTROPH-02087,TATA7980
2016,341,1015,GEOTROPH-02004,TATA7980
2016,93,1027,GEOTROPH-02066,TATA7980
2016,205,1007,GEOTROPH-02021,TATA7980
2016,79,1013,GEOTROPH-02143,TATA7980
2016,51,1050,GEOTROPH-02000,TATA7980
2016,85,1010,GEOTROPH-02048,TATA7980
2016,47,1036,GEOTROPH-02048,TATA7980
2016,259,1079,GEOTROPH-01909,TATA3405
2016,55,1003,GEOTROPH-02147,TATA7980
2016,205,1031,GEOTROPH-02037,TATA7980
2016,21,1036,GEOTROPH-02060,TATA7980
2016,207,1032,GEOTROPH-01984,TATA7980
2016,51,1037,GEOTROPH-02154,TATA7980
2016,55,1021,GEOTROPH-02011,TATA7980
2016,341,1018,GEOTROPH-01968,TATA7980
2016,43,1020,GEOTROPH-02089,TATA7980
2016,181,1016,GEOTROPH-02047,TATA7980
2016,89,1032,GEOTROPH-02158,TATA7980
2016,71,1032,GEOTROPH-01923,TATA3405
2016,175,1025,GEOTROPH-01906,TATA3405
2016,43,1066,GEOTROPH-01925,TATA3405
2016,201,1028,GEOTROPH-01961,TATA7980
2016,201,1042,GEOTROPH-01916,TATA3405
2016,31,1021,GEOTROPH-02184,TATA7980
2016,39,1017,GEOTROPH-02141,TATA7980
2016,223,1014,GEOTROPH-02119,TATA7980
2016,41,1040,GEOTROPH-02052,TATA7980
2016,57,1034,GEOTROPH-02118,TATA7980
2016,79,1061,GEOTROPH-01909,TATA3405
2016,35,1006,GEOTROPH-02147,TATA7980
2016,341,1032,GEOTROPH-02176,TATA7980
2016,33,1020,GEOTROPH-02116,TATA7980
2016,167,1043,GEOTROPH-01929,TATA3405

I want to split this file into as many files as are testcases in column 5 ( 2 files in this case TATA34052016__testcase.csv and TATA79802016__testcase.csv), each containing the header line

Here is my code

awk -F, -v fl="testcase"  'NR == 1 {hdr = $0
       next
}
f != $5"2016__"fl".csv" {
        if(f) close(f)
        f = $5"2016__"fl".csv"
        print hdr > f
}
{       print >> f
}' tmp

which produces the 2 files but deletes some records.

My desired output is


TATA79802016__testcase.csv

year,min,max,Instrument,Testcase
2016,201,1003,GEOTROPH-02116,TATA7980
2016,53,1011,GEOTROPH-01963,TATA7980
2016,3,1024,GEOTROPH-02067,TATA7980
2016,203,1027,GEOTROPH-02011,TATA7980
2016,71,1036,GEOTROPH-02030,TATA7980
2016,173,1046,GEOTROPH-02033,TATA7980
2016,183,1015,GEOTROPH-02120,TATA7980
2016,47,1037,GEOTROPH-02154,TATA7980
2016,63,1017,GEOTROPH-01959,TATA7980
2016,73,1019,GEOTROPH-01966,TATA7980
2016,317,1012,GEOTROPH-02077,TATA7980
2016,9,1027,GEOTROPH-02089,TATA7980
2016,61,1018,GEOTROPH-01947,TATA7980
2016,341,1029,GEOTROPH-02087,TATA7980
2016,341,1015,GEOTROPH-02004,TATA7980
2016,93,1027,GEOTROPH-02066,TATA7980
2016,205,1007,GEOTROPH-02021,TATA7980
2016,79,1013,GEOTROPH-02143,TATA7980
2016,51,1050,GEOTROPH-02000,TATA7980
2016,85,1010,GEOTROPH-02048,TATA7980
2016,47,1036,GEOTROPH-02048,TATA7980
2016,55,1003,GEOTROPH-02147,TATA7980
2016,205,1031,GEOTROPH-02037,TATA7980
2016,21,1036,GEOTROPH-02060,TATA7980
2016,207,1032,GEOTROPH-01984,TATA7980
2016,51,1037,GEOTROPH-02154,TATA7980
2016,55,1021,GEOTROPH-02011,TATA7980
2016,341,1018,GEOTROPH-01968,TATA7980
2016,43,1020,GEOTROPH-02089,TATA7980
2016,181,1016,GEOTROPH-02047,TATA7980
2016,89,1032,GEOTROPH-02158,TATA7980
2016,201,1028,GEOTROPH-01961,TATA7980
2016,31,1021,GEOTROPH-02184,TATA7980
2016,39,1017,GEOTROPH-02141,TATA7980
2016,223,1014,GEOTROPH-02119,TATA7980
2016,41,1040,GEOTROPH-02052,TATA7980
2016,57,1034,GEOTROPH-02118,TATA7980
2016,35,1006,GEOTROPH-02147,TATA7980
2016,341,1032,GEOTROPH-02176,TATA7980
2016,33,1020,GEOTROPH-02116,TATA7980

and TATA34052016__testcase.csv

year,min,max,Instrument,Testcase
2016,135,1016,GEOTROPH-01932,TATA3405
2016,65,1023,GEOTROPH-01931,TATA3405
2016,259,1079,GEOTROPH-01909,TATA3405
2016,71,1032,GEOTROPH-01923,TATA3405
2016,175,1025,GEOTROPH-01906,TATA3405
2016,43,1066,GEOTROPH-01925,TATA3405
2016,201,1042,GEOTROPH-01916,TATA3405
2016,79,1061,GEOTROPH-01909,TATA3405
2016,167,1043,GEOTROPH-01929,TATA3405

Please assist with correcting my code.

greet_sed · November 14, 2016, 5:53pm

Hi,

Can you try this one ?

awk -F, 'NR==1{a=$0;next} $5 ~ /TATA7980/ { if (NR==2) print a > $5".csv";else print > $5".csv"} $5 ~ /TATA3405/ { print > $5"_f1.csv"} ' OFS=","file

Modify for filenames a bit.

rdrtx1 · November 14, 2016, 7:41pm

yr=$(date +%Y)
rm -f *_testcase.csv
while read line
do
   [[ -z "$line1" ]] && { line1=$line ; continue ; }
   fl=${line##*,}${yr}__testcase.csv
   [[ -f "$fl" ]] || echo "$line1" > "$fl"
   echo "$line" >> "$fl"
done < infile.csv

Aia · November 14, 2016, 7:49pm

Can you do Perl?

This one has the advantage of taking in consideration the year of the data.

Save as process_input.pl
Run as perl process_input.pl testcase senhia83.input
Make sure you enter two arguments:
The suffix to the filename
The input filename

#!/usr/bin/perl
use strict;
use warnings;

my %data;
my $file_concat = shift;

my $header = <>;
while(<>) {
    chomp;
    my @fields = split ',';
    push @{$data{$fields[-1]}{$fields[0]}}, $_;
}

for my $fname (keys %data) {
    for my $year (keys $data{$fname}) {
        open my $fw, '>', "$fname${year}__$file_concat.csv" || die;
        print $fw $header;
        for my $value (@{$data{$fname}{$year}}) {
            print $fw "$value\n";
        }
        close $fw;
    }
}

senhia83 · November 14, 2016, 8:45pm

This is great, is it too much trouble to modify Aia and rdrtx1 code, if I want to specify the column number to key off on? The testcase column may not be the last column in the actual dataset.

So how do I change the code if the key column is col62 among 128 columns? Thanks in advance.

Aia · November 14, 2016, 9:02pm

Instead of:

push @{$data{$fields[-1]}{$fields[0]}}, $_;

Change to:

push @{$data{$fields[61]}{$fields[0]}}, $_;

Always, one less than what you count. The amount of total columns is not consequential. It can be an arbitrary amount of columns.

senhia83 · November 14, 2016, 10:36pm

Aia,

This is the error i get. Am I running it wrong?

$ perl spltwheader.pl testcase tmp6

Type of arg 1 to keys must be hash (not hash element) at spltwheader.pl line 16, near "}) "
Execution of spltwheader.pl aborted due to compilation errors.

RavinderSingh13 · November 14, 2016, 10:45pm

Hello senhia83,

Could you please try following and let me know if this helps you(though honestly saying I am not completly sure about your query).

awk -F, 'NR==1{print > "TATA34052016__testcase.csv"; print > "TATA79802016_testcase.csv"} ($NF ~ /TATA3405/){print >> "TATA34052016__testcase.csv"} ($NF ~ /TATA7980/){print >> "TATA79802016_testcase.csv"}'  Input_file

Output it will create 2 files named TATA34052016__testcase.csv and TATA79802016_testcase.csv .

Thanks,
R. Singh

Aia · November 14, 2016, 10:58pm

senhia83:

Aia,

This is the error i get. Am I running it wrong?

$ perl spltwheader.pl testcase tmp6

Type of arg 1 to keys must be hash (not hash element) at spltwheader.pl line 16, near "}) "
Execution of spltwheader.pl aborted due to compilation errors.

Oh, my...my! Picky version of Perl you have!
Please, replace

 for my $year (keys $data{$fname}) {

for

 for my $year (keys %{$data{$fname}}) {

senhia83 · November 14, 2016, 11:15pm

@Ravinder ...your solution is too specific and hardcoded, I`m looking for something generic and flexible like Aia and rdrtx1. Thanks a lot for your help.

@ Aia , the edits worked ! running now..many thanks..

RudiC · November 15, 2016, 7:07am

Try also

awk '
NR==1   {HD = $0
         next
        }

!FN[$5] {FN[$5] = $5 "2016__testcase.csv"
         print HD  >  FN[$5]
        }

        {print $0  >  FN[$5]
        }
' FS="," file

greet_sed · November 15, 2016, 8:04am

@ RudiC:

Can you please explain the last part ?

awk '
NR==1   {HD = $0 # Get first line ie header
         next    # Dont run any other commands. ie read next line.
        }

!FN[$5] {FN[$5] = $5 "2016__testcase.csv" # here set array FN with index set as value of $5 . it is true twice for given input sample.
         print HD  >  FN[$5] # redirect header from HD to file
        }

        {print $0  >  FN[$5] # Print the current line to file. How this differentiates two different $5 , lines and print them to respective files ?
        }
' FS="," file

RudiC · November 15, 2016, 8:44am

The output is redirected to the respective file, controlled by $5, as requested in post#1. If a new (i.e. not yet registered) $5 shows up in a line, the new file name is constructed in FN[$5] , and then, all lines with the same $5 are assembled in the same file. The number of files (= different unique $5) is "unlimited" (except for system factors e.g. memory).

RavinderSingh13 · November 15, 2016, 9:16am

Hello Rudi,

Thank you for nice explanation, I would like to mention here wouldn't be it good to close the opened files or if there are too many files opened it may be an issue with it. Could you please put some light on this too, will be grateful to you sir.

Thanks,
R. Singh

rdrtx1 · November 15, 2016, 9:37am

column=5

rm -f *_testcase.csv
(( acol = column - 1 ))
while read line
do
   [[ -z "$line1" ]] && { line1=$line ; continue ; }
   IFS=, cols=($line)
   fl=${cols[$acol]}${yr}__testcase.csv
   [[ -f "$fl" ]] || echo "$line1" > "$fl"
   echo "$line" >> "$fl"
done < infile.csv

RudiC · November 15, 2016, 10:10am

Yes, if the number of files threatens to exceed the system configuration parameter OPEN_MAX (my linux: 1024) or awk's internal limit (here:1022), a close (FN[$5]) after each print >> FN[$5] might help. Please note the >> redir3ction operator to append lines to the file.

drl · November 15, 2016, 11:18am

Hi.

At our shop, we often try to generalize problem solutions, rather than having a number of separate solutions for similar problems.

In this kind of case we created a code gate to handle these kind of situations. The center of this would be the single command:

gate -f=5 $FILE

Here is how we would start to solve this:

#!/usr/bin/env bash

# @(#) s1       Demonstrate collection of matching lines to separate files, gate.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf gate
pe
dixf gate

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE, lines = $( wc -l < $FILE):"
head $FILE

pl " Expected output:"
grep csv $E

pl " Results, dry run:"
gate --dryrun --field=5 $FILE

pl " Results, real run:"
gate -f=5 $FILE
wc -l Testcase TATA*

pl " Documentation for gate:"
gate -h

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.12
gate (local) 1.10

gate    Group, collect, sift lines, separate lines into buckets. (what)
Path    : ~/bin/gate
Length  : 278 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with [     ]-h

-----
 Input data file data1, lines = 50:
year,min,max,Instrument,Testcase
2016,201,1003,GEOTROPH-02116,TATA7980
2016,53,1011,GEOTROPH-01963,TATA7980
2016,3,1024,GEOTROPH-02067,TATA7980
2016,203,1027,GEOTROPH-02011,TATA7980
2016,135,1016,GEOTROPH-01932,TATA3405
2016,71,1036,GEOTROPH-02030,TATA7980
2016,173,1046,GEOTROPH-02033,TATA7980
2016,183,1015,GEOTROPH-02120,TATA7980
2016,47,1037,GEOTROPH-02154,TATA7980

-----
 Expected output:
TATA79802016__testcase.csv
and TATA34052016__testcase.csv

-----
 Results, dry run:
     9 entries in group (good) TATA3405
    40 entries in group (good) TATA7980
     1 entries in group (good) Testcase

-----
 Results, real run:
   1 Testcase
   9 TATA3405
  40 TATA7980
  50 total

-----
 Documentation for gate:
NAME
    gate - Group And Transfer Entries

VERSION
    1.0

USAGE
    gate [-d] [-c m,n] [-f n] [-h] [-s c] [files]

REQUIRED ARGUMENTS
    None.

OPTIONS
    -d <off>
        dry-run, determine filenames and counts. A basic sanity check on
        filenames is performed. This is recommended for a first run.

    -c m,n <0>
        Select column mode: define columns m through n as the filenames to
        which lines will be written. The columns are numbered beginning with
        1 on the left.

    -f number <0>
        Choose field number as the group. Data will be written to filenames
        corresponding to the contents of this field. The field numbering
        begins with 1 on the left.

    -h <off>
        Print this man page.

    -s character string <,>
        Select one or more characters as the separator character.

    files <STDIN>
        Specify files to process.

DESCRIPTION
    For each line read, the id value from the field or columns specified is
    isolated from each and used as a filename, to which that line is
    written. As many files are opened and written as are needed.

This may not be the final approach, but it would be a start for us.

Best wishes ... cheers, drl

---------- Post updated at 10:18 ---------- Previous update was at 09:46 ----------

Hi.

I seem to recall testing the timing between closing and not closing files in this kind of operation. My recollection is that it added a large amount of time to do the close-open-close... operation.

While there may be situations where this might be needed. I'd be tempted to consider splitting the input file, and do separate operations on the split pieces (for example). Multiple cores/CPU can help with parallelization in such cases, say with command parallel .

Best wishes ... cheers, drl