3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0
3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997| 3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defib
rillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excisi
on calendar for excision of cyst on chest and back. Thanks!
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0
3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1 week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|
0
whereas the bold-font records are "normal" (starting with an integer field, having NO imbedded new lines within the record), non-bold-font entry is "abnormal" having an embedded "\n" (new line) character [in this example, it appears after "Thanks!"(most probably entered by a data-entry clerk while inputting a record). I need to concatenate the lines starting with "^|" with the lines immediately preceding them (using sed/awk/Perl)
Unfortunately, as the data is coming from the DB2 database (the database we don't own), it won't be easier to to fix the data at the source; therefore, I was seeking a programmatic solution from the outside. Could you recommend one?
Do you have control over the data extract program?
Referrring to your original post, I'm having trouble seeing the record structure or even deciding where one record stops and the next record starts. Some records seem to end with a pipe character and some with something else. The number of fields in each record also varies.
Is your example data an accurate representation of the record format. i.e. A random number of pipe-delimited fields? It would be so much easier if there were a fixed number of fields in each record.
Here is a re-posted set of records; they do have a fixed format (fixed number of fields (25), "|" separated); the only problem is that some of these records get disjointed when imported by the AIX from DB2 (the data transport program is 'isql' utility by Sybase). Records in bold type are intact (without new line); record in non-bold has one embedded new line character that I need to get rid of thus concatenating the disjointed lines. The anchor pattern is "^|"
3113296571|NULL|NULL|NULL||N| 1| 0| 926667| 1001036| 0| 3076120438|NULL|NULL|NULL|NULL|DUE FOR NEW CONSENT!|N|NULL|
10198318|2011-07-25-12.34.02.786000|NULL|NULL|NULL| 0
3113336478|NULL|NULL|NULL||N| 1| 0| 931871| 1001315| 1000997|
3076140580|NULL|NULL|NULL|NULL|Cosmetic; Patient has no Defibrillator or Pacemaker; Location: Eisenberg 7A; Dx/Indication: Cyst; Excision: 1-2 sites = 1 hour; Additional information (specify): Please put on my excision calendar for excision of cyst on chest and back. Thanks![Line splits here]
|N|NULL| 11418290|2011-07-11-14.14.02.226000|NULL|NULL|NULL| 0
3113338127|NULL|NULL|NULL||N| 1| 0| 927074| 1001109| 0| 3076141370|NULL|NULL|NULL|NULL|Return Instructions (other): v
a only 1st available when patient returns from vacation - thanks!|C|1 month out, not 1week| 15462510|2011-07-11-14.31.26.803000|NULL|NULL|NULL|0
Thanks very much for the reply. It almost worked; the nawk choked, however as it found some instances of other spurious new lines embedded in the fields. Now, the problem needs to be formulated slightly differently:
with the original set of data, find any spurious new line characters (not necessarily anchored to "^|"). The only legitimate new lines are the ones that immediately precede the primary key integer field (3110103087 and the like, having 10 digits)
A Shell method. Keep building the record until we have 24 pipe delimiters.
Allows for more than one extra line break in any one record.
#!/bin/ksh
saveline=""
cat filename.txt | while read line
do
line="${saveline}${line}"
count=`echo "${line}"|tr -cd '|'|wc -c` # Number of pipes
if [ $count -ne 24 ]
then
saveline="${line}"
continue
else
saveline=""
echo "${line}"
fi
done
upon further examination of the input file I have encountered spurious new lines that were not anchored to pattern "^|". In such cases, your "nawk" one-liner choked with the following message:
awk: 0602-565 There are not enough parameters in printf statement Both Eyes VA TA Dilate: 1%M Dilate: 2.5%N Contacts: previous Indication: Exam Eye [V72.0]|N||15589218|2011.11.1611:01:16:436||||0.
The input line number is 1742.
The source line number is 1.
The task at hand is to find (and replace) any spurious new lines NOT IMMEDIATELY preceding the integer 10-digit primary key
I tried the modified version of your "nawk" one-liner and it worked. Interestingly, the two files that are needed for comparison come from 2 RDBMS's (DB2 & Sybase), the resulting data files have different "number" of lines in them (even though the number of primary keys in the original SQL files is the same); this signifies that the spuriousness of the new line characters is severe and unpredictable.