Split variable length and variable format CSV file

chechun · February 5, 2010, 11:40am

Dear all,

I have basic knowledge of Unix script and her I am trying to process variable length and variable format CSV file.

The file length will depend on the numbers of Earnings/Deductions/Direct Deposits.
And
The format will depend on whether it is Earnings/Deductions or Direct Deposits

For example, if employee has 2 earning, 2 deductions, and 2 direct deposits accounts, The record after EmployeeID will repeat twice for Earnings, repeat twice for Deductions, and repeat twice for direct deposits.
For earnings and deposits, the format are the same, they need to have earnings/deposit numbers then earnings/deposit amounts.
For direct deposits, the format is more complicated, it will require Bank Number, Account Type, Account Number, Amount,

Here is the sample data. The first line is header.

EmployeeID,Total numbers of Earnings (E), Total numbers of Deductions (D), Total numbers of Direct Deposit (DD), E/D #1 Number, E/D #1 Amount, E/D #2 Number, E/D #2 Amount,....,E/D #n Number, E/D #n Amount, DD #1 Bank Number, DD #1 Account Type, DD #1 Account Number, DD #1 Amount, DD #2 Bank Number, DD #2 Account Type, DD #2 Account Number, DD #2 Amount,......,DD #n Bank Number, DD #n Account Type, DD #n Account Number, DD #n Amount, 
1,1,2,1,01,12000,13,1000,14,1000,010001001,ABC,123456,10000
2,2,2,2,01,10000,02,2000,11,500,12, 500,010001002,ABC,1212121,5000,010001003,CDE,123123,6000

For Employee #1, he/she has 1 earning, 2 deductions, and 1 direct deposit, so his/her record will have one earning (01), two deductions (13 and 14), and one direct deposit account (010001001,ABC,123456)
For Employee #2, he/she has 2 earnings, 2 deduction,s and 2 direct deposits, so his/her record will have one earning (01 and 02), two deductions (11 and 12), and two direct deposit accounts (010001002,ABC,1212121 and 010001003,CDE,123123)

Since the numbers of Earnings/Deductions or Direct Deposits could vary, the length and format of the records in CSV file are different.

I have several options.

I can seperate the file into three files. One for Earning, one for deduction, and one for direct deposit. Each file will have employee id in the beginning of the record.
so for sample data will look like

Earning File

EmployeeID,Earning #1 Number, Earning  #1 Amount,....,Earning #n Number, Earning  #n Amount
1,01,12000
2,01,10000,02,2000

Deduction File

EmployeeID,Deduction #1 Number, Deduction  #1 Amount,....,Deduction #n Number, Deduction  #n Amount
1, 13,1000,14,1000
2,11,500,12, 500

Direct Deposit File

EmployeeID,#1 Bank Number, #1 Account Type, #1 Account Number, #1 Amount,..... #n Bank Number, #n Account Type, #n Account Number, #n Amount
1,010001001,ABC,123456,10000
2,010001002,ABC,1212121,5000,010001003,CDE,123123,6000

or
2. I can further "normalize" each record into sepeate records in each files.

Earning File

EmployeeID,Earning Number, Earning Amount
1,01,12000
2,01,10000
2,02,2000

Deduction File

EmployeeID,Deduction Number, Deduction  Amount
1,13,1000
1,14,1000
2,11,500
2,12,500

Direct Deposit File

EmployeeID,Bank Number, Account Type, Account Number, Amount
1,010001001,ABC,123456,10000
2,010001002,ABC,1212121,5000,010001003,CDE,123123,6000

Thank you very much for your time to help me to resolve this issue.

chechun

chechun · February 18, 2010, 4:34pm

Unfortunately, I couldn't find a good solution in Unix shell script. I have resolved this issue using PL/SQL. I used several for loop and combination of substr and instr to resolve it. Thanks!

radoulov · February 19, 2010, 9:07am

Hm,
if it's not too late, you can use something like this:

earnings:

perl -F, -lane'
next if $. == 1; print join ",", $F[0], @F[4 .. (3 + $F[1]) * 2];
' infile.csv

deductions:

perl -F, -lane'
next if $. == 1; print join ",", $F[0], @F[(3 + $F[1] * 2) + 1 .. (3 + $F[1] * 2) + $F[2] * 2];
' infile.csv

etc.

For example, given your sample file:

% perl -F, -lane'
next if $. == 1; print join ",", $F[0], @F[4 .. (3 + $F[1]) * 2];
' infile.csv
1,01,12000,13,1000,14
2,01,10000,02,2000,11,500,12

% perl -F, -lane'
next if $. == 1; print join ",", $F[0], @F[(3 + $F[1] * 2) + 1 .. (3 + $F[1] * 2) + $F[2] * 2];
' infile.csv
1,13,1000,14,1000
2,11,500,12, 500

Let me know if you still need it, I can help you to incorporate the code in a single program.