Join Lines every paragraph in a file.txt

lxdorney · October 9, 2016, 4:49am

Hi all,

Is there any idea on how to automate convert the paragraph in one line in a file, this will happen after OCR the documents, OCR split every paragraph. I need to join all the paragraph in one line.

#cat file.txt

The Commission on Higher Education (CHED) was created through Republic Act 7722
otherwise known as the Higher Education Act of 1994. Consistent with the government�s
position of making education the central strategy for poverty reduction, the CHED shall
pursue the following objectives: (a) relevant and quality higher education within an
international milieu; (b) accessible and affordable higher education programs; (c) resolute
academic freedom to promote intellectual growth, learning, research and development to
produce high quality leaders and professionals; and (d) moral ascendancy for better
governance within its ranks and the entire higher education system.

Higher Education Budget.  The total 2015 proposed budget for Social Services will
amount to P938.8 billion of which P450.2 billion (48.0%) is for Education, Culture and
Manpower Development.  The proposed 2015 budget of CHED at P3.6 billion accounts for
0.4% of Social Services budget.  On the other hand, the 2015 proposed budget for State
Universities and Colleges (SUCs) amounting to P43.3 billion is roughly 4.6% of total Social
Services budget (Table 1).

On the aggregate, higher education budget for 2015 shall increase by 1.8% from the 2014
level although this increase is much lower from the 2013-2014 increase of 14.3%.  While
CHED experienced a substantial increase of 165.2% in 2013-2014, its proposed budget for
2015 implies a 55.6% decline from 2014.  Meanwhile, SUCs� proposal for 2015 shall expand
higher education institutions� (HEIs) budget by 13.8%, a significant increase from 2013-2014
expansion of 2.1% (Table 1).

Result should be this:

The Commission on Higher Education (CHED) was created through Republic Act 7722 otherwise known as the Higher Education Act of 1994. Consistent with the government�s position of making education the central strategy for poverty reduction, the CHED shall pursue the following objectives: (a) relevant and quality higher education within an international milieu; (b) accessible and affordable higher education programs; (c) resolute academic freedom to promote intellectual growth, learning, research and development to produce high quality leaders and professionals; and (d) moral ascendancy for better governance within its ranks and the entire higher education system.

Higher Education Budget.  The total 2015 proposed budget for Social Services will amount to P938.8 billion of which P450.2 billion (48.0%) is for Education, Culture and Manpower Development.  The proposed 2015 budget of CHED at P3.6 billion accounts for 0.4% of Social Services budget.  On the other hand, the 2015 proposed budget for State Universities and Colleges (SUCs) amounting to P43.3 billion is roughly 4.6% of total Social Services budget (Table 1).

On the aggregate, higher education budget for 2015 shall increase by 1.8% from the 2014 level although this increase is much lower from the 2013-2014 increase of 14.3%.  While CHED experienced a substantial increase of 165.2% in 2013-2014, its proposed budget for 2015 implies a 55.6% decline from 2014.  Meanwhile, SUCs� proposal for 2015 shall expand higher education institutions� (HEIs) budget by 13.8%, a significant increase from 2013-2014 expansion of 2.1% (Table 1).

Thanks

greet_sed · October 9, 2016, 5:53am

Hi,

Can you please try the following one?

sed ':a;N;$!ba;s/\n\n/XZX/g;s/\n/ /g;s/XZX/\n\n/g' file

i tested for given input and gives your desired output.

if you think you might get XZX part of file contents, can try below one:

sed -e '/./{H;$!d;}' -e 'x;s/\n//g;s/$/\n/' file

In awk,

awk 'BEGIN{ ORS=RS="\n\n";} {gsub(/\n/," ")}1'  file

blastit.fr · October 9, 2016, 6:51am

Hi,

This line works perfectly on your sample file :

awk -v RS='\n\n' '{$1=$1}1' file.txt  > ocr.tmp

lxdorney · October 9, 2016, 8:34am

thanks for the reply anyway i have sample file in my google drive public share Update your browser to use Google Drive - Drive Help

When I apply the script to the sample text files from google drive public share. file content is in one line only

greet_sed · October 9, 2016, 10:09am

yes, as per your post#1 input, between paragraphs you had blank line. However in post#4, files shared in googledrive, has a space in those lines. Hence those commands didnt perform what you wanted.

Here is what i have tried from 1181.txt ( part of the contents taken from this file ).

sed -e '/[a-zA-Z0-9]/{H;$!d;}' -e 'x;s/\n//g;s/$/\n/' ~/Downloads/1181.txt > OUT

I got the desired output. Modify the command as per your need.

Scrutinizer · October 9, 2016, 10:19am

Note: only GNU awk and mawk can use a Regex expression for the RS variable. Other awks only take a single (leftmost) character
for RS...

An alternative that should work in any modern Posix awk and nawk:

awk -v RS="" '{$1=$1}1' file.txt

lxdorney · October 9, 2016, 10:33am

Thanks I will Try this tomorrow

blastit.fr · October 9, 2016, 4:00pm

Another way is to considere any line whith no visible character as a paragraph separator.
Using blank as the default separator , the awk NF variable will by equal to 0 in case you have either blank spaces or nothing.
So this code should fit on most cases :

awk 'NF{printf "%s" ,$0 ;next}{print ""}' file.txt

But in your sample file from post #4, we must take care of the consecutive empty lines.
I guess this should be seen as one single paragraph separator.

My final code :

awk 'NF{printf "%s" ,$0 ;nblf=0;next}nblf == 0{print "";nblf++}' file.txt

---------- Post updated at 10:00 PM ---------- Previous update was at 09:16 PM ----------

Using your file 1181.txt , I noticed there is an LF on the first line.
See this code shoud fit better, as this leading empty line will be removed.

awk 'NF{printf "%s" ,$0 ;istext++;next}istext{print "";istext=0}' 1181.txt  > 1181.res