Splitting delimited string into rows

techmoris · May 18, 2015, 3:22pm

Hi,

I have a requirement that has 50-60 million records that we need to split a delimited string (Delimeter is newline) into rows.

Source Date:

SerialID UnidID GENRE
100 A11 AAAchar(10)BBB
200 B11 CCCchar(10)DDD(10)ZZZZ

Field 'GENRE' is a string with new line as delimeter and not sure how many it may have?

Please advise!

Thanks

RudiC · May 18, 2015, 3:42pm

Please use code tags as required by forun rules!

I guess this is from MS EXCEL where <NL> (0x0A, \n) is used as a marker to split strings into rows within a cell?

Taking your sample into *nix makes it look like

SerialID UnidID GENRE
100 A11 AAA           
BBB
200 B11 CCC           
DDD 
ZZZZ

WHAT exactly do you want to split into rows?

techmoris · May 18, 2015, 3:47pm

Hi,

I have a requirement that has 50-60 million records that we need to split a delimited string (Delimeter is newline) into rows.

Source Data

SerialID UnidID GENRE
100 A11 AAAchar(10)BBB
200 B11 CCCchar(10)DDD(10)ZZZZ

Expected Output

SerialID UnidID GENRE
100 A11 AAA
100 A11 BBB
200 B11 CCC
200 B11 DDD
200 B11 ZZZZ

Field 'GENRE' is a string with new line as delimeter and not sure how many it may have?

Please advise!

Thanks

RudiC · May 18, 2015, 3:51pm

Are you sure the input data looks like you posted, and, if yes, are you sure you're on *nix?

techmoris · May 18, 2015, 3:53pm

Field Genre can have any number of values separated by a newline delimeter.

RudiC · May 18, 2015, 3:55pm

As I said, newline has a special meaning on *nix.

Given my suspicion (see post#2) is true, try:

awk 'NR==1 {print; next} NF==3 {TMP=$1 OFS $2} {print TMP OFS $NF}' file3
SerialID UnidID GENRE
100 A11 AAA
100 A11 BBB
200 B11 CCC
200 B11 DDD
200 B11 ZZZZ