Merge lines in text file based on pattern

Bertik · February 1, 2010, 3:36am

Hello,

I have searched forum trying to find a solution to my problem, but could not find anything or I did not understand the examples....

I should say, I am very inexperienced with text processing.

I have a text file with approx 60k lines in it.
I need to merge lines based on the number on the end of the "master line".

Example1:

Word|1
(1)|Wordel|One Word

So, here I need to delete one pipe character and number 1 after the word 'Word' and merge first line with second line. Delete number 1 and this "(" and this ")" characters. Result should look like this:

Word|Wordel|One Word

Example 2:

Eye|4
(1)|Human Eye|Animal Eye
(2)|My Eye|Your Eye|His Eye|Her Eye
(3)|Second Eye|Third Eye
(4)|So Much About Eye

Here I need to delete one pipe character and number 4 after the word 'Eye' and merge first line with four following lines. Delete number 1 and 2 and 3 and 4 and this "(" and this ")" characters. Result should look like this:

Eye|Human Eye|Animal Eye|My Eye|Your Eye|His Eye|Her Eye|Second Eye|Third Eye|So Much About Eye

So, if the txt file looks like this before processing:

Word|1
(1)|Wordel|One Word
Eye|4
(1)|Human Eye|Animal Eye
(2)|My Eye|Your Eye|His Eye|Her Eye
(3)|Second Eye|Third Eye
(4)|So Much About Eye

I need it look like this after processing:

Word|Wordel|One Word
Eye|Human Eye|Animal Eye|My Eye|Your Eye|His Eye|Her Eye|Second Eye|Third Eye|So Much About Eye

Could somebody help me with this please?

danmero · February 1, 2010, 5:28am

bertik:

So, if the txt file looks like this before processing:
Word|1
(1)|Wordel|One Word
Eye|4
(1)|Human Eye|Animal Eye
(2)|My Eye|Your Eye|His Eye|Her Eye
(3)|Second Eye|Third Eye
(4)|So Much About Eye
I need it look like this after processing:
Word|Wordel|One Word
Eye|Human Eye|Animal Eye|My Eye|Your Eye|His Eye|Her Eye|Second Eye|Third Eye|So Much About Eye
Could somebody help me with this please?

awk 'BEGIN{FS=OFS="|"}/^[A-Z]/{printf (NR==1?_:RS)$1;next}{$1=_;printf}' infile

Bertik · February 1, 2010, 6:34am

Thank you danmero for your help.

Unfortunately, it is giving me this:

awk: line 1: no arguments in call to printf

and not processing anything.

I forget to mention that the txt file uses unicode characters.
Also, I am running on Ubuntu 9.10

danmero · February 1, 2010, 7:32am

Try gawk on Ubuntu.

Bertik · February 1, 2010, 7:55am

Now it is giving me this:

zhudlaitgawk: (FILENAME=1.txt FNR=2) fatal: printf: no arguments

The word zhudlait is first word on the first line in 1.txt file

Whole first line looks like this:

zhudlait|1

danmero · February 1, 2010, 9:01am

Can you post a sample of real data enclosed in [code] tags.

Bertik · February 1, 2010, 9:14am

Sure, I can

zhudlait|1
(1)|zkazit <co>|zpackat
vrch|4
(1)|kopec|pahorek|vr�ek|vyv��enina
(2)|pahorkatina|vrchovina
(3)|vrchol
(4)|povrch|hoej�ek
�korpit se|1
(1)|ha�teit se|h�dat se|�k�dlit se
spurtovat|2
(1)|zrychlit bh
(2)|fini�ovat
opl�st|1
(1)|ovinout <co �m>|omotat|obtoit
b�ze|2
(1)|z�klad|z�kladna|podklad|v�chodisko
(2)|z�sada|alk�lie
�ehnat se|1
(1)|louit se <s k�m>|roz�ehn�vat se
zahnat|3
(1)|odehnat <koho>|zapudit|zapla�it
(2)|ukojit (hlad)|uti�it|za�ehnat
(3)|odn�st <co kam> (boue)
rozohovat se|1
(1)|rozv�ovat se|rozpalovat se|uklidovat se
pipamatovat|1
(1)|pipomenout|vzpomenout|upamatovat|zapomenout
prop�t|1
(1)|utratit (pit�m)|proh�it|prochlastat
dopustit se|1
(1)|sp�chat (zloin)|prov�st
klepnout|4
(1)|uknout|kliknout (tla�tkem)|cvaknout
(2)|zas�hnout <koho>|pra�tit
(3)|zab�t|ranit (mrtvice)
(4)|popov�dat si|poklepat si|zdrbnout si
zotroen�|2
(1)|utlaen�|poroben�
(2)|nesvobodn�

Thank you for all your time danmero.

danmero · February 1, 2010, 12:25pm

Base on your sample data:

# awk 'BEGIN{FS=OFS="|"}/^[a-z]/{printf (NR==1?_:RS)$1;next}{$1=_;printf}' infile
zhudla.it|zkazit <co>|zpackat
vrch|kopec|pahorek|vr.ek|vyv�.enina|pahorkatina|vrchovina|vrchol|povrch|ho.ej.ek|1|ha.te.it se|h�dat se|.k�dlit se
spurtovat|zrychlit b.h|fini.ovat
opl�st|ovinout <co .�m>|omotat|obto.it
b�ze|z�klad|z�kladna|podklad|v�chodisko|z�sada|alk�lie|1|lou.it se <s k�m>|roz.ehn�vat se
zahnat|odehnat <koho>|zapudit|zapla.it|ukojit (hlad)|uti.it|za.ehnat|odn�st <co kam> (bou.e)
rozoh.ovat se|rozv�..ovat se|rozpalovat se|uklid.ovat se
p.ipamatovat|p.ipomenout|vzpomenout|upamatovat|zapomenout
prop�t|utratit (pit�m)|proh�.it|prochlastat
dopustit se|sp�chat (zlo.in)|prov�st
klepnout|.uknout|kliknout (tla.�tkem)|cvaknout|zas�hnout <koho>|pra.tit|zab�t|ranit (mrtvice)|popov�dat si|poklepat si|zdrbnout si
zotro.en�|utla.en�|poroben�|nesvobodn�

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

Bertik · February 2, 2010, 8:08am

I could not make it work danmero. Anyway, thank you for your time.

Here is command which worked (it added pipe character to the end of each line also).

awk -F\| -v OFS=\| '/^\(/{sub(/[^|]*|/,"");var=var $0;next}var{print var "|" }{NF--;var=$0}END{print var "|"}' file

Credit for this command is going to pgas from #awk at irc.freenode.net