from one word for line to plain text

Hello!
I've got a very big file (from tokenization) which has one word for line.
How is it possible then to rebuild the "original" text, knowing that <s> and </s> are the sentence-delimiters?

My file looks like this:

<s>
&&
tanzania
na
Afrika
kwa
ujumla
ambiwa
na
taifa
kubwa
tajiri
zinduka
na
piga
mwendo
hima
saka
maendeleo
.
</s>
<s>
agizwa
na
fundishwa
fuata
wayo
za
nchi
endelea
wezesha
fikia
hapo
zili
;
</s>
<s>
nayo
itika
wito
huo
.
</s>
<s>
itika
kwa
sauti
kubwa
na
nidhamu
pya
kiasi
kwamba
wakati
moja
rais
wa
serikali
ya
awamu
ya
tatu
,
mheshimiwa
Benjamin
William
pa
,
tunukiwa
na
nchi
hizo
heshima
ya
wa
mwenyekiti
mwenza
wa
tume
ya
utandawazi
pamoja
na
waziri
kuu
wa
nchi
tajiri
ya
Finland
,
bibi
tarja
halonen
.
</s>
<s>
ingi
ona
kwamba
undwa
kwa
tume
hiyo
ni
moja
ya
mbinu
za
ingiza
nchi
(
maskini
)
za
dunia
ya
tatu
katika
mfumo
wa
ubepari
wa
taifa
,
kwa
kauli
mbiu
ya
*
ubia
katika
maendeleo
*
.
</s>
... ... ... 

Thanks a lot for any help!
Mjomba

Something like this?

awk '/<s>/{next} /<\/s>/{$0="\n"}1' ORS=" " file

Thank you very much, Franklin52!
It does exactly what I need.
Just a little modification, if it's possible.
So to get rid of the blank chr which seems to be inserted at the beginning of each line.
Thanks again!
mjomba

Try this:

awk '/<s>/{next} /<\/s>/{print s;s="";next} {s=length==1?s $0:s?s FS $0:$0}' file

Can you explain this line of code for me?

Thank you I appreciate it!

awk '/<s>/{next} /<\/s>/{print s;s="";next} {s=length==1?s $0:s?s FS $0:$0}' file

Explanation:

/<s>/{next} | Skip the line with <s>

/<\/s>/{print s;s="";next} | If the line matches with </s> print the variable s, empty the variable s and get the next line

s=length==1?s $0 | If the length of the line is 1 (a dot) don't add a space between s and the line

:s?s FS $0:$0 | else if s isn't empty add a space and the current line to s else if s is empty, s=the current line

2 Likes