Need to concatenate spuriously separated lines

lemele · December 23, 2011, 10:35am

Given the pattern below:

3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0
3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997| 3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defib
rillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excisi
on calendar for excision of cyst on chest and back. Thanks!
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0
3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1 week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|
0

whereas the bold-font records are "normal" (starting with an integer field, having NO imbedded new lines within the record), non-bold-font entry is "abnormal" having an embedded "\n" (new line) character [in this example, it appears after "Thanks!"(most probably entered by a data-entry clerk while inputting a record). I need to concatenate the lines starting with "^|" with the lines immediately preceding them (using sed/awk/Perl)

Your help is greatly appreciated.

methyl · December 23, 2011, 11:08am

It must be easier to fix the data while it is still in the database and while you have the whole field available.
What software wrote this file?

lemele · December 23, 2011, 11:36am

Methyl,

Unfortunately, as the data is coming from the DB2 database (the database we don't own), it won't be easier to to fix the data at the source; therefore, I was seeking a programmatic solution from the outside. Could you recommend one?

lemele

methyl · December 23, 2011, 12:06pm

Do you have control over the data extract program?

Referrring to your original post, I'm having trouble seeing the record structure or even deciding where one record stops and the next record starts. Some records seem to end with a pipe character and some with something else. The number of fields in each record also varies.
Is your example data an accurate representation of the record format. i.e. A random number of pipe-delimited fields? It would be so much easier if there were a fixed number of fields in each record.

vgersh99 · December 23, 2011, 12:28pm

Hard to see the record/line breaks withOUT the proper code tagging....

nawk -F'|' '$1~"^[0-9][0-9]*$" {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf $0}' myFile

lemele · December 23, 2011, 12:39pm

Methyl,

Here is a re-posted set of records; they do have a fixed format (fixed number of fields (25), "|" separated); the only problem is that some of these records get disjointed when imported by the AIX from DB2 (the data transport program is 'isql' utility by Sybase). Records in bold type are intact (without new line); record in non-bold has one embedded new line character that I need to get rid of thus concatenating the disjointed lines. The anchor pattern is "^|"

3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0

3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997|
3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defibrillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excision calendar for excision of cyst on chest and back. Thanks![Line splits here]
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0

3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|0

Anxiously awaiting a solution

Lemele

methyl · December 23, 2011, 12:41pm

I agree with vgersh99. I believe that the way the data sample has been posted into the original post has introduced more extra line terminators!

The original data fault is the sort of effect you get if people cut/paste data from another Window (like from an email). It's all too common.

lemele · December 23, 2011, 1:03pm

vgersh99,

Thanks very much for the reply. It almost worked; the nawk choked, however as it found some instances of other spurious new lines embedded in the fields. Now, the problem needs to be formulated slightly differently:

with the original set of data, find any spurious new line characters (not necessarily anchored to "^|"). The only legitimate new lines are the ones that immediately precede the primary key integer field (3110103087 and the like, having 10 digits)

Awaiting a slightly modified solution

methyl · December 23, 2011, 1:05pm

A Shell method. Keep building the record until we have 24 pipe delimiters.
Allows for more than one extra line break in any one record.

#!/bin/ksh
saveline=""
cat filename.txt | while read line
do
        line="${saveline}${line}"
        count=`echo "${line}"|tr -cd '|'|wc -c` # Number of pipes
        if [ $count -ne 24 ]
        then
                saveline="${line}"
                continue
        else
                saveline=""
                echo "${line}"
        fi
done

Ps. This would be better in awk !

vgersh99 · December 23, 2011, 1:47pm

lemele:

Methyl,

Here is a re-posted set of records; they do have a fixed format (fixed number of fields (25), "|" separated); the only problem is that some of these records get disjointed when imported by the AIX from DB2 (the data transport program is 'isql' utility by Sybase). Records in bold type are intact (without new line); record in non-bold has one embedded new line character that I need to get rid of thus concatenating the disjointed lines. The anchor pattern is "^|"

3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0

3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997|
3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defibrillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excision calendar for excision of cyst on chest and back. Thanks![Line splits here]
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0

3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|0

Anxiously awaiting a solution

Lemele

nawk -F'|' '$1~"^[0-9][0-9]*$" && length($1)==10 {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf $0}' myFile

lemele · December 23, 2011, 4:58pm

gersh99,

upon further examination of the input file I have encountered spurious new lines that were not anchored to pattern "^|". In such cases, your "nawk" one-liner choked with the following message:

awk: 0602-565 There are not enough parameters in printf statement Both Eyes VA TA Dilate: 1%M Dilate: 2.5%N Contacts: previous Indication: Exam Eye [V72.0]|N||15589218|2011.11.1611:01:16:436||||0.
The input line number is 1742.
The source line number is 1.

The task at hand is to find (and replace) any spurious new lines NOT IMMEDIATELY preceding the integer 10-digit primary key

Please help.

lemele

binlib · December 23, 2011, 8:56pm

Assuming that there are no embedded pipes and all the records have the same number of "|"s, you can fix the file much easier with awk or even sed.

lemele · December 27, 2011, 9:56am

binlib,

thanks for your comment; can you suggest a concrete solution to the problem at hand?

Thanks

methyl · December 27, 2011, 10:18am

Does my Shell version work?
If it does, this will give the "awk" programmers a program design.

vgersh99 · December 27, 2011, 12:03pm

Interesting - works just fine with AiX's awk.
Try:

nawk -F'|' '$1~"^[0-9][0-9]*$" && length($1)==10 {printf("%c%s", (FNR==1)?"":ORS,$0);next}{printf("%s", $0)}' myFile

lemele · December 27, 2011, 12:51pm

gersh99,

I tried the modified version of your "nawk" one-liner and it worked. Interestingly, the two files that are needed for comparison come from 2 RDBMS's (DB2 & Sybase), the resulting data files have different "number" of lines in them (even though the number of primary keys in the original SQL files is the same); this signifies that the spuriousness of the new line characters is severe and unpredictable.

Thank you so much for your continued help.

Lemele

doesntcompute2k · December 28, 2011, 1:21am

My thought:

 
#!/usr/bin/perl
my ($dataline, $record, $buildline);
$buildline = "";
$dataline = 0;
open(INFILE, "<", "datafile.txt") or die "Cannot open datafile.txt: $!";
while( $record = <INFILE>) {
       chomp($record);
       $dataline++;
       if ($record =~ m/^\d.*0$/) {
          # full line
          if ($buildline ne "") {         # Was there partial previous line?
             print "$buildline\n";
          }
          print "$record\n";
          $buildline = "";            # Reset line buffer for next fetch
       }
       else {
            $buildline = $buildline . $record;
       }
}
close(INFILE);

dc++