CSV formatting with prefixing, appending and padding field

Hi I have a very large csv file with some hundreds of thousands of rows of data.

The data is in the following format:
Up to four alpha numeric characters for the first word. This is either set as 2 characters followed by 2 spaces, or as a single 4character word. This is then followed by an 8 character alphanumeric word making a total of 12 characters in total. The number of characters and positioning is the same for every line. ie, 4 and 8.

"AB  324578DE"
"ABCDAAA87958"
"ZF  4R85T79B"

I need a way to reformat the code by taking the first 4 characters and then padding them to the end of the field as an 18 character field. So there will be 6 spaces in between the 8 character word and the original first 4 characters. ie. 8, 6, and 4.

For example the data above will become:

"324578DE      AB  "
"AAA87958      ABCD"
"4R85T79B      ZF  "

This will then be prefixed with a static text of "CODE," and then appended with 18 commas to make it a proper csv file ready for database import.

See final result below(minus the double quotes):

"CODE,324578DE      AB  ,,,,,,,,,,,,,,,,,,"
"CODE,AAA87958      ABCD,,,,,,,,,,,,,,,,,,"
"CODE,4R85T79B      ZF  ,,,,,,,,,,,,,,,,,,"

I can do the last step of prefixing and appending by using the following code where input.csv will be the results of the first text swap and padding:

sed 's/^.*/CODE,&/'  input.csv | sed -e 's/$/,,,,,,,,,,,,,,,,,,/' > output.csv

however I am stuggling to perform the first step of the text swapping and padding. I was thinking of the following logic to get this to work:

  1. Append 6 blank spaces to end of field.
  2. Copy first 4 characters and append to end of field.
  3. Delete first 4 characters.

I can't seem to get step 2 though.

I would appreciate any hints on getting this whole thing working.

No need to run sed multiple times; you can have multiple commands in a sed script. Try this:

sed '
  s/^\(....\)\(.*\)/\2      \1/
  s/^.*/CODE,&/
  s/$/,,,,,,,,,,,,,,,,,,/
' input.csv > output.csv

The second command there identifies a 4 character substring and a rest-of-string substring, then reinserts them into the replacement with a six-character separator (much like & does, but it inserts the entire match).

However you could simplify that further and do it in one step, e.g.:

sed 's/^\(....\)\(.*\)/CODE,\2      \1,,,,,,,,,,,,,,,,,,/' input.csv > output.csv

Thank you Annihilannic your solution worked really well.

:b:

sed 's/\(....\)\(.*\)/\2      \1/' a.txt

I didn't realise it before, but the requirement is not as simple as I had first thought.

The second field is not always 8 characters as I had assumed.

I've since found that some are 3, 4, 7 characters etc.

I guess what I should have said, is that the first four characters must be appended to the end, where this appending must start after 14 characters to make a total field length of 18 characters.

So given the following:

"AB  12345"
"ABCDAA123456"
"ZF  4R85T7"
"BFR ABC"

will become:

"12345         AB  "
"AA123456      ABCD"
"4R85T7        ZF  "
"ABC           BFR "

This sounds overly complicated to me. Any help would be very appreciated.

In that case I would use awk rather than sed to benefit from availability of printf() with appropriate formatting specifiers:

awk '{ printf("CODE,%-14s%-4s,,,,,,,,,,,,,,,,,,\n",substr($0,5),substr($0,1,4)) }' input.csv > output.csv

For example %-14s means a left-justified 14 character field.

Once again Annihilannic, thank you very much.

Your awk solution worked brilliantly.

Is the following understanding of your correct?

%-14s%-4s

This means a 14character field followed by a 4 character field.

 
substr($0,5),substr($0,1,4))

This means, the 14 character field of the output file is made up of the substring from position 5 onwards from the input file and the 4 character field of the output file consists of the substring from position 1-4 from the original input file.

Sorry if it's a silly question but I'd much rather learn as I go instead of just blindly copying code without learning anything.

Spot on.

The first parameter of printf() is the "format specification" which contains placeholders (prefixed by %) for the variable parts. The remaining parameters are the variables which will fill those placeholders in the output string.

It's very useful to know because you'll find it in many languages now... originally from C.