Need to concatenate spuriously separated lines

Given the pattern below:

3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0
3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997| 3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defib
rillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excisi
on calendar for excision of cyst on chest and back. Thanks!
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0
3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1 week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|
0

whereas the bold-font records are "normal" (starting with an integer field, having NO imbedded new lines within the record), non-bold-font entry is "abnormal" having an embedded "\n" (new line) character [in this example, it appears after "Thanks!"(most probably entered by a data-entry clerk while inputting a record). I need to concatenate the lines starting with "^|" with the lines immediately preceding them (using sed/awk/Perl)

Your help is greatly appreciated.

It must be easier to fix the data while it is still in the database and while you have the whole field available.
What software wrote this file?

Methyl,

Unfortunately, as the data is coming from the DB2 database (the database we don't own), it won't be easier to to fix the data at the source; therefore, I was seeking a programmatic solution from the outside. Could you recommend one?

lemele

Do you have control over the data extract program?

Referrring to your original post, I'm having trouble seeing the record structure or even deciding where one record stops and the next record starts. Some records seem to end with a pipe character and some with something else. The number of fields in each record also varies.
Is your example data an accurate representation of the record format. i.e. A random number of pipe-delimited fields? It would be so much easier if there were a fixed number of fields in each record.

Hard to see the record/line breaks withOUT the proper code tagging....

nawk -F'|' '$1~"^[0-9][0-9]*$" {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf $0}' myFile
1 Like

Methyl,

Here is a re-posted set of records; they do have a fixed format (fixed number of fields (25), "|" separated); the only problem is that some of these records get disjointed when imported by the AIX from DB2 (the data transport program is 'isql' utility by Sybase). Records in bold type are intact (without new line); record in non-bold has one embedded new line character that I need to get rid of thus concatenating the disjointed lines. The anchor pattern is "^|"

3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0

3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997|
3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defibrillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excision calendar for excision of cyst on chest and back. Thanks![Line splits here]
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0

3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|0

Anxiously awaiting a solution

Lemele

I agree with vgersh99. I believe that the way the data sample has been posted into the original post has introduced more extra line terminators!

The original data fault is the sort of effect you get if people cut/paste data from another Window (like from an email). It's all too common.

vgersh99,

Thanks very much for the reply. It almost worked; the nawk choked, however as it found some instances of other spurious new lines embedded in the fields. Now, the problem needs to be formulated slightly differently:

with the original set of data, find any spurious new line characters (not necessarily anchored to "^|"). The only legitimate new lines are the ones that immediately precede the primary key integer field (3110103087 and the like, having 10 digits)

Awaiting a slightly modified solution

A Shell method. Keep building the record until we have 24 pipe delimiters.
Allows for more than one extra line break in any one record.

#!/bin/ksh
saveline=""
cat filename.txt | while read line
do
        line="${saveline}${line}"
        count=`echo "${line}"|tr -cd '|'|wc -c` # Number of pipes
        if [ $count -ne 24 ]
        then
                saveline="${line}"
                continue
        else
                saveline=""
                echo "${line}"
        fi
done

Ps. This would be better in awk !

nawk -F'|' '$1~"^[0-9][0-9]*$" && length($1)==10 {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf $0}' myFile

gersh99,

upon further examination of the input file I have encountered spurious new lines that were not anchored to pattern "^|". In such cases, your "nawk" one-liner choked with the following message:

awk: 0602-565 There are not enough parameters in printf statement Both Eyes VA TA Dilate: 1%M Dilate: 2.5%N Contacts: previous Indication: Exam Eye [V72.0]|N||15589218|2011.11.1611:01:16:436||||0.
The input line number is 1742.
The source line number is 1.

The task at hand is to find (and replace) any spurious new lines NOT IMMEDIATELY preceding the integer 10-digit primary key

Please help.

lemele

Assuming that there are no embedded pipes and all the records have the same number of "|"s, you can fix the file much easier with awk or even sed.

binlib,

thanks for your comment; can you suggest a concrete solution to the problem at hand?

Thanks

Does my Shell version work?
If it does, this will give the "awk" programmers a program design.

Interesting - works just fine with AiX's awk.
Try:

nawk -F'|' '$1~"^[0-9][0-9]*$" && length($1)==10 {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf("%s", $0)}' myFile

gersh99,

I tried the modified version of your "nawk" one-liner and it worked. Interestingly, the two files that are needed for comparison come from 2 RDBMS's (DB2 & Sybase), the resulting data files have different "number" of lines in them (even though the number of primary keys in the original SQL files is the same); this signifies that the spuriousness of the new line characters is severe and unpredictable.

Thank you so much for your continued help.

Lemele

My thought:

 
#!/usr/bin/perl
my ($dataline, $record, $buildline);
$buildline = "";
$dataline = 0;
open(INFILE, "<", "datafile.txt") or die "Cannot open datafile.txt: $!";
while( $record = <INFILE>) {
       chomp($record);
       $dataline++;
       if ($record =~ m/^\d.*0$/) {
          # full line
          if ($buildline ne "") {         # Was there partial previous line?
             print "$buildline\n";
          }
          print "$record\n";
          $buildline = "";            # Reset line buffer for next fetch
       }
       else {
            $buildline = $buildline . $record;
       }
}
close(INFILE);

dc++