How to add line to previous line if not a number?

carlr · January 16, 2016, 12:24pm

Hi,
I am trying to compare 2 lists. However, one of these lists has to be taken from a.pdf file. When I copy the test into a .txt document there are formatting errors which I need to correct. The document is long (~10,000 lines) so I need to script the re-formatting.

Currently my file looks like:

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0

But i want it to look like the below: i.e alternating between a line of didgets and a line of text. I beleive that there should eb a way of making a SED or AWK statement which looks at each line, and if there are 2 consecutive lines starting with a letter A-Z, moves the second of these to be after the first.

Can any one hep me to do this?

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0

RavinderSingh13 · January 16, 2016, 12:56pm

Hello carlr,

Please use code tags for commands/code/Inputs which you are using in your posts as per forum rules. Could you please try following and let me know if this helps.

awk '($0 ~ /^[[:alpha:]]/){A=$0;getline;if($0 ~ /^[[:alpha:]]/){if(A){print A OFS $0;A=""}} else {print A ORS $0;A=""};next}{print}'  Input_file

Output will be as follows.

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0

EDIT: Let's say we have a Input_file where there may be two or more consecutive occurances of those lines which are starting from alphabets, then following code may help in same.
Following is the Input_file:

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
R. Singh is a bad boy.
R. Singh likes Iron man.
863.0
R. Singh loves UNIX.com

Then following is the code for same.

awk 'FNR==NR{MAX++;next} {if($0 ~ /^[[:alpha:]]/){while($0 ~ /^[[:alpha:]]/ && FNR<MAX){Q=Q?Q OFS $0:$0;getline};if(Q && $0 ~ /^[[:alpha:]]/){print Q OFS $0} else {print Q ORS $0};Q=""}}'  Input_file  Input_file

Output will be as follows.

DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT R. Singh is a bad boy. R. Singh likes Iron man.
863.0
R. Singh loves UNIX.com

Thanks,
R. Singh

Scrutinizer · January 16, 2016, 8:02pm

Try:

awk 'END{if(ORS==x)printf RS} {ORS=x} /^[0-9]/{if(NR>1)printf RS; ORS=RS}1' file

anbu23 · January 17, 2016, 12:53am

$ awk ' /^[^0-9]/{ getline a; $0=$0 (a ~ /^[^0-9]/ ? a: "\n"a) } 1' file
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0

sed "/^[^0-9]/{N;s/\n\([^0-9]\)/\1/;}" file

Scrutinizer · January 17, 2016, 4:01am

Some of the solutions will fail if there is a single last line that does not start with a number:

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO

This will produce a duplicate one but last line:

[..]
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
863.0

This will leave out the last line:

[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0

Using $!N instead of N should fix that...

--

This does not seem to work.. I get:

[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO FOO

[..]

awk 'FNR==NR{MAX++;next} {if($0 ~ /^[[:alpha:]]/){while($0 ~ /^[[:alpha:]]/ && FNR<MAX){Q=Q?Q OFS $0:$0;getline};if(Q && $0 ~ /^[[:alpha:]]/){print Q OFS $0} else {print Q ORS $0};Q=""}}'  Input_file  Input_file

This does not seem to work either, I get:

[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
INJURY TO GASTROINTESTINAL TRACT
863.0

FOO

RavinderSingh13 · January 17, 2016, 6:40am

scrutinizer:

Some of the solutions will fail if there is a single last line that does not start with a number:

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO

This does not seem to work.. I get:

[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO FOO

This does not seem to work either, I get:

[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
INJURY TO GASTROINTESTINAL TRACT
863.0

FOO

Hello Scrutnizer,

Thank you for letting me know . I have fixed the 2nd code, working on 1st code to fix it too and will update my post.
Let's say we have following Input_file:

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO

Then following code may give us as requsted output, I had made minor changes into code, like in my previous Input_file I didn't consider that a line may start from space or etc too, so now I am considering that a line which is having alphabates and other lines which will only have digits.

awk 'FNR==NR{MAX++;next} {if($0 ~ /[[:alpha:]]/){while($0 ~ /[[:alpha:]]/ && FNR<MAX){Q=Q?Q OFS $0:$0;getline};if(Q && FNR==MAX){;getline;print Q OFS $0} else {if(Q){print Q ORS $0} else {print $0}}} else {print};Q=""}' Input_file  Input_file

Output will be as follows.

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF  OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO

Now taking my previous Input_file as follows.

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
R. Singh is a bad boy.
R. Singh likes Iron man.
R. Singh lloves UNIX.com

After running code it will provide following output then:

code:
awk 'FNR==NR{MAX++;next} {if($0 ~ /[[:alpha:]]/){while($0 ~ /[[:alpha:]]/ && FNR<MAX){Q=Q?Q OFS $0:$0;getline};if(Q && FNR==MAX){;getline;print Q OFS $0} else {if(Q){print Q ORS $0} else {print $0}}} else {print};Q=""}' Input_file  Input_file

Output will be as follows.

862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT R. Singh is a bad boy. R. Singh likes Iron man. R. Singh lloves UNIX.com

Thanks,
R. Singh

Don_Cragun · January 17, 2016, 6:40pm

Hi Ravinder,
Note that your code is adding a field separator when joining alphanumeric data lines. (I'm guessing this is because carlr didn't use CODE tags when presenting the sample input and output files and you copied the sample data before Scrutinizer edited that post to include the tags that made the <space> at the start of the line that needed to be joined visible.)

Hi Scrutinizer,
I like your suggested awk script and it works perfectly for the sample data given. Unfortunately, the sample data carlr provided doesn't agree with the description of the actions to be taken:

Currently my file looks like:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
But i want it to look like the below: i.e alternating between a line of didgets and a line of text. I believe that there should eb a way of making a SED or AWK statement which looks at each line, and if there are 2 consecutive lines starting with a letter A-Z, moves the second of these to be after the first.

Note that the line shown in red does not meet the requirement shown in red. The line to be combined starts with a <space>; not an uppercase alphabetic character. Your code compensated for that inconsistency by looking for a non-digit instead of looking for an uppercase alpha.

I'm guessing that what carlr really wants to do is join any lines that contain anything other than digits and a possible decimal point. This would allow input like:

862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
 OPEN WOUND INTO CAVITY
862.9
2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION O
F OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT

to be turned into:

862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.9
2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT

instead of the output your script produces:

862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.9

2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION O
F OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT

If input like this is a concern to the submitter, something like this (that I had created before I saw Scrutinizer's suggestion):

awk '
{	d2 = ($0 ~ /[^0-9.]/) ? "" : ORS
	d1 = (d2 && NR > 1) ? d2 : ""
	printf("%s%s%s", d1, $0, d2)
}
END {	if(!d2)	print ""
}' file

or (using Scrutinzer's code as a base):

awk 'END{if(ORS==x)printf RS} {ORS=x} !/[^0-9.]/{if(NR>1)printf RS; ORS=RS}1' file

might work better.

Aia · January 17, 2016, 9:27pm

Assumption:
Any line that starts with spaces and the previous one does not start with a number is an anomaly not intended, created by the import of the pdf.

Please, try the following:
Save as fix.pl

#!/usr/bin/env perl -w
#
use strict;

my $head = <>;
while(<>){
    chomp $head if(/^\s/ and $head =~ /^\D/);
    print $head;
    $head = $_;
}
print $head;

Run as perl fix.pl carlr.txt > fixed.txt

MadeInGermany · January 18, 2016, 5:16am

I would augment the previous sed solution with a loop, and replace the line end with a space.

sed -e '/^[^0-9]/{' -e ':L' -e '$!N; s/\n\([^0-9]\)/ \1/; t L' -e '}' file

As multi-line

sed '
/^[^0-9]/{
  :L
  $!N
  s/\n\([^0-9]\)/ \1/
  t L
}
' file

Note: $!N is required by a standard sed, otherwise it does not print he last line.
In a recent GNU sed just N is sufficient. But $!N works as well.

Scrutinizer · January 18, 2016, 8:12am

Two basic awk things to think of in this kind of situation:

If the default FS is used, test $1 instead of $0 . That way you are not dependent on leading spaces that may or may not be there.
If you must use getline , use it conditionally while testing if the return code is >0. For example: if((getline a)>0)