Removing \n within a record (awk/gawk)

CKT_newbie88 · May 12, 2009, 1:28pm

I am using a solution that was provided by a member:

awk '{s=$0;if(length(s) < 700){getline; s=s " " $0}printf("%s\n",s)}'

This scans through a file and removes '\n' within a record but not the record delimiter.

However, there are instances where there are MULTIPLE instances of '\n' within the record. How do a modify this code to account for multiple instance of \n and not remove the record delimiter?

It is a fixed width file with each record of 700 characters in length.

Please help.
Thanks

ghostdog74 · May 12, 2009, 8:03pm

show a sample of file.

CKT_newbie88 · May 13, 2009, 9:13am

There are actually 2 issues

A fixed with record with multiple \n instances with \n as its record delimiter
A comma delimited file with \n embedded in the record with \n as its record delimiter.

Expected Result
1234abcd4569My Name is JackSmithJS1231900-01-01

Example (Fixed Width - with multiple \n within record)
1234abcd4569My Name is Jack
SmithJS123
1900-01-01

Example (Comma Delimted - with multiple \n within record)
1234,"abcd",4569,"My Name is Jack
Smith","JS123
",1900-01-01

In both cases, I do not want to remove the \n as its record delimiter

devtakh · May 13, 2009, 12:01pm

Can you provide the expected o/p for the above 2 inputs?

Franklin52 · May 13, 2009, 12:06pm

Give an exact format of your file, not only lines you want to combine. Post an example within code tags.

Regards

CKT_newbie88 · May 13, 2009, 1:36pm

Due to sensitive data, the following is a sample of 4 lines of data (in reality, the file contains 18 fields with the 11th field being problematic - since it is a Varchar 256 free-from field). All records have a \n as its record delimiter.

RecordID Integer
NameID Char
SubRecID Integer
Desc Char100
UserCD Char
Date Date

Sample Rows (FTP'd from Mainframe) - Comma Delimited
2345,"wxyz",2345,"Her Name is Nancy Drew","ND001","1900-01-01"
1234,"abcd",4569,"My Name is
Jack
Smith","JS123","1900-01-01"
5667,"gghd",9984,"His Name is
Joe Hardy","JH007","1900-01-01"
3333,"aaaa",9999,"Our Group is Excel Point","EP009","1900-01-01"

First Row shows stantdard format (no issue)
Second Row shows 2 embedded \n instances
Third Row shows 1 embedded \n instance
Fourth Row shows standard format (no issue)

Expected Output:
2345,"wxyz",2345,"Her Name is Nancy Drew","ND001","1900-01-01"
1234,"abcd",4569,"My Name is Jack Smith","JS123","1900-01-01"
5667,"gghd",9984,"His Name is Joe Hardy","JH007","1900-01-01"
3333,"aaaa",9999,"Our Group is Excel Point","EP009","1900-01-01"

Thanks

Franklin52 · May 13, 2009, 1:45pm

Try this:

awk -F, '{printf("%s%s", $0,$NF ~ /[0-9]-[0-9]/?RS:"")}' file

Regards

CKT_newbie88 · May 13, 2009, 2:08pm

Hi Franklin52,

I tried the code you supplied - it had an error:

awk: syntax error near line 1
awk: illegal statement near line 1

I also tried it with 'gawk' and it modified some rows...but not all - especially for those records that have multiple instance of \n within (which essentially spreads 1 row into 3 or 4).

Franklin52 · May 13, 2009, 2:17pm

This is the output I get:

$ cat file
2345,"wxyz",2345,"Her Name is Nancy Drew","ND001","1900-01-01"
1234,"abcd",4569,"My Name is
Jack
Smith","JS123","1900-01-01"
5667,"gghd",9984,"His Name is
Joe Hardy","JH007","1900-01-01"
3333,"aaaa",9999,"Our Group is Excel Point","EP009","1900-01-01"
$
$ awk -F, '{printf("%s%s", $0,$NF ~ /[0-9]-[0-9]/?RS:"")}' file
2345,"wxyz",2345,"Her Name is Nancy Drew","ND001","1900-01-01"
1234,"abcd",4569,"My Name isJackSmith","JS123","1900-01-01"
5667,"gghd",9984,"His Name isJoe Hardy","JH007","1900-01-01"
3333,"aaaa",9999,"Our Group is Excel Point","EP009","1900-01-01"
$

CKT_newbie88 · May 13, 2009, 3:06pm

Hi Franklin52,

I tried the code again, and the same result occurs.

Is there a way to add a space when we combine the rows?

1234,"abcd",4569,"My Name is JackSmith","JS123","1900-01-01"
5667,"gghd",9984,"His Name is Joe Hardy","JH007","1900-01-01"

The example given is what I am trying to replicate - perhaps the real file (18 fields with 3 Varchar(256) fields) is not properly depicted in the example.

Franklin52 · May 13, 2009, 3:12pm

That's why I asked for an exact format of your file, to add a space between the broken lines:

awk -F, '{printf("%s%s", $0,$NF ~ /[0-9]-[0-9]/?RS:" ")}' file