Hi,
I am trying to compare 2 lists. However, one of these lists has to be taken from a.pdf file. When I copy the test into a .txt document there are formatting errors which I need to correct. The document is long (~10,000 lines) so I need to script the re-formatting.
Currently my file looks like:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
But i want it to look like the below: i.e alternating between a line of didgets and a line of text. I beleive that there should eb a way of making a SED or AWK statement which looks at each line, and if there are 2 consecutive lines starting with a letter A-Z, moves the second of these to be after the first.
Can any one hep me to do this?
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
Please use code tags for commands/code/Inputs which you are using in your posts as per forum rules. Could you please try following and let me know if this helps.
awk '($0 ~ /^[[:alpha:]]/){A=$0;getline;if($0 ~ /^[[:alpha:]]/){if(A){print A OFS $0;A=""}} else {print A ORS $0;A=""};next}{print}' Input_file
Output will be as follows.
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
EDIT: Let's say we have a Input_file where there may be two or more consecutive occurances of those lines which are starting from alphabets, then following code may help in same.
Following is the Input_file:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
R. Singh is a bad boy.
R. Singh likes Iron man.
863.0
R. Singh loves UNIX.com
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT R. Singh is a bad boy. R. Singh likes Iron man.
863.0
R. Singh loves UNIX.com
$ awk ' /^[^0-9]/{ getline a; $0=$0 (a ~ /^[^0-9]/ ? a: "\n"a) } 1' file
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
Some of the solutions will fail if there is a single last line that does not start with a number:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
This will produce a duplicate one but last line:
[..]
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
863.0
This will leave out the last line:
[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
Using $!N instead of N should fix that...
--
This does not seem to work.. I get:
[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO FOO
This does not seem to work either, I get:
[..]
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
Thank you for letting me know . I have fixed the 2nd code, working on 1st code to fix it too and will update my post.
Let's say we have following Input_file:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
Then following code may give us as requsted output, I had made minor changes into code, like in my previous Input_file I didn't consider that a line may start from space or etc too, so now I am considering that a line which is having alphabates and other lines which will only have digits.
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
863.0
FOO
Now taking my previous Input_file as follows.
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
R. Singh is a bad boy.
R. Singh likes Iron man.
R. Singh lloves UNIX.com
After running code it will provide following output then:
862.0
DIAPHRAGM, WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.1
DIAPHRAGM, WITH OPEN WOUND INTO CAVITY
862.3
OTHER SPECIFIED INTRATHORACIC ORGAN, WITH OPEN WOUND INTO CAVITY
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT R. Singh is a bad boy. R. Singh likes Iron man. R. Singh lloves UNIX.com
Hi Ravinder,
Note that your code is adding a field separator when joining alphanumeric data lines. (I'm guessing this is because carlr didn't use CODE tags when presenting the sample input and output files and you copied the sample data before Scrutinizer edited that post to include the tags that made the <space> at the start of the line that needed to be joined visible.)
Hi Scrutinizer,
I like your suggested awk script and it works perfectly for the sample data given. Unfortunately, the sample data carlr provided doesn't agree with the description of the actions to be taken:
Note that the line shown in red does not meet the requirement shown in red. The line to be combined starts with a <space>; not an uppercase alphabetic character. Your code compensated for that inconsistency by looking for a non-digit instead of looking for an uppercase alpha.
I'm guessing that what carlr really wants to do is join any lines that contain anything other than digits and a possible decimal point. This would allow input like:
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF
OPEN WOUND INTO CAVITY
862.9
2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION O
F OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
to be turned into:
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.9
2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
instead of the output your script produces:
862.8
MULTIPLE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION OF OPEN WOUND INTO CAVITY
862.9
2 OR MORE AND UNSPECIFIED INTRATHORACIC ORGANS WITHOUT MENTION O
F OPEN WOUND INTO CAVITY
863
INJURY TO GASTROINTESTINAL TRACT
If input like this is a concern to the submitter, something like this (that I had created before I saw Scrutinizer's suggestion):
Assumption:
Any line that starts with spaces and the previous one does not start with a number is an anomaly not intended, created by the import of the pdf.
Please, try the following:
Save as fix.pl
#!/usr/bin/env perl -w
#
use strict;
my $head = <>;
while(<>){
chomp $head if(/^\s/ and $head =~ /^\D/);
print $head;
$head = $_;
}
print $head;