Pairing the nth elements on multiple lines iteratively

John_Lyon · August 1, 2015, 1:00pm

Hello,

I'm trying to create a word translation section of a book. Each entry in the word list will come from a set of linguistically analyzed texts.

Each sentence in the text has the following format. The first element in each line is the "name" of the line (i.e. "A","B","C","D"). The first line is the object language, the second line is a morpheme gloss, the third and fourth lines are stem/word-level translations:

A word1 word2 word3 word4
B wordW wordX wordY wordZ
C wordA wordB wordC wordD
D wordI wordII wordIII wordIV

What I'd like to do is pull the nth element in 2 or more lines (not counting the line "name"), and output them as a pair (or n-tuple) on the same line, later to be exported as columns to a spreadsheet. So for the above, I'd like:

word1 wordW wordA wordI
word2 wordX wordB wordII
word3 wordY wordC wordIII
word4 wordZ wordD wordIV

Note that the initial "name" elements occur several thousand times in the file, and I'd like to take care of all lines so named at the same time. Thanks, any ideas?

Don_Cragun · August 1, 2015, 9:22pm

I'm not sure what you mean by:

If you're saying that a single "name" can appear more than once in your input file and that when that happens the input lines need to be combined somehow, you need to give us sample input where that condition exists and show us how that is supposed to affect the output.

First, save the following in a file named tester :

#!/bin/ksh
awk '
{	for(i = 2; i <= NF; i++)
		o[i - 1, NR] = $i
	if(NF > m) m = NF
}
END {	for(i = 1; i < m; i++)
		for(j = 1; j <= NR; j++)
			printf("%s%s", o[i, j], j == NR ? ORS : OFS)
}' "$@"

And, if, and only if, you want to run this on a Solaris/SunOS system, change awk in the script to /usr/xpg4/bin/awk or nawk . Then make the script executable:

chmod +x tester

And, if you have input files file containing your sample input and file2 containing:

E e1
F f1 f2 f3 f4 f5
G g1 g2 g3
H h1 h2
I i1 i2 i3 i4 i5 i6 i7

then the command:

./tester file

produces the output you requested:

word1 wordW wordA wordI
word2 wordX wordB wordII
word3 wordY wordC wordIII
word4 wordZ wordD wordIV

and the command:

./tester OFS="," file2

produces the output:

e1,f1,g1,h1,i1
,f2,g2,h2,i2
,f3,g3,,i3
,f4,,,i4
,f5,,,i5
,,,,i6
,,,,i7

which shows how you can use a comma (or any other character string you want) as the output field separator and that it will correctly align output fields adjusted to account for input files where the number of input columns is not a constant. (If your input always has five input columns and you always want to produce four output rows, you can easily simplify this script some; but I'll leave that as a simple exercise for the reader.)

danmero · August 2, 2015, 10:14am

Base on your data sample try the following one-liner

awk 'END{for(l=1;l++<NF;)print o[l]}{for(l=I;l++<NF;){o[l]=((o[l])?o[l]FS:S)$l}}' file

Since this is your first post please read and understand the forum rules

John_Lyon · August 2, 2015, 1:17pm

Thanks to you both for your replies. I was trying to keep it simple, but I should've added more information, I think. Here goes:

The data come from a LaTeX file, which uses a package called "Expex" which formats interlinear analyses of a non-English language.

The following two examples show how the data is laid out. The first line "\gla" is the object language, the second line "\glb" is the underlying form, the third line "\glc" is the morpheme gloss, the fourth line "\glc" is the word translation (the package doesn't allow "\gld" for whatever reason), and the last line "\glft" is the sentence translation. As you see, the number of words varies from example to example, just as natural language sentences may be shorter, or longer.

Each "word" is enclosed in curly brackets in the first two lines (though other sets of curly brackets may be nested within words), but only separated by spaces in the second two lines. The curly brackets are necessary to delimit words in the first two lines since some latex commands (e.g. "\ts" below) require blank spaces after them.

\gla {itl\'i\textglotstop } {k\textsuperscript{w}uk\textsuperscript{w}} {t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.}// 
\glb {itl\'i\textglotstop } {k\textsuperscript{w}uk\textsuperscript{w}} {tc+\ts x\textsuperscript{w}\'uy}//
\glc \textsc{dem} \textsc{rep} \textsc{loc}+go //
\glc from.there they.say came.over.this.way //
\glft `They said he was coming along.' //

\gla {u\textbeltl } {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop } {lut} {i\textglotstop } {q\'aqx\textsuperscript{w}\textschwa lx} {ka\textglotstop} {cx\textsuperscript{w}uys} {i\textglotstop } {l} {siw\textbeltl k\textsuperscript{w}.} //
\glb {u\textbeltl } {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop } {lut} {i\textglotstop } {q\'a(\tb)\ts qx\textsuperscript{w}lx} {ki\textglotstop} {c\textendash \ts x\textsuperscript{w}uy\textendash s} {i\textglotstop } {l} {siw\textbeltl k\textsuperscript{w}} //
\glc \textsc{conj} say \textsc{rep} \textsc{dem} \textsc{neg} \textsc{det} fish \textsc{comp.obl} \textsc{cust}\textendash go\textendash \textsc{3sg.poss} \textsc{det} \textsc{loc} water //
\glc and he.said they.say here no the fish where.that they.come the through water //
\glft `Coyote said there will be no fish going through the water here.' //

The \glft line may be ignored, but what I'd like exactly is the following, where "&" denotes a column separator in LaTeX and "\\" indicates a newline. Each line has 4 "words", i.e. the nth word in each of the first four lines in the examples above.

{itl\'i\textglotstop } & {itl\'i\textglotstop } & \textsc{dem} & from.there \\
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} & they.say \\

Etcetera. Once the first example is done, the second example would be appended to the above list. Eventually each line will be sorted alphabetically by the first "column". It'd also be nice to be able to choose which input lines to include in the output, though I'd greatly appreciate any more assistance you could give in obtaining the basic result just outlined. Thanks again.

Don_Cragun · August 2, 2015, 6:01pm

Please don't give us "Etcetera."! Show us the exact output you are trying to produce from the 11 line sample input you showed us.

We need to see what is supposed to be done in the output when there are unequal numbers of "words" in input lines.

We need to see how the output lines corresponding to groups of input lines are supposed to be separated.

If you want output sorted, you also need to explain MUCH more clearly what the sort key is and explain how sorting on the 1st column of the output is going to maintain groups of associated output lines??? (The sort utility sorts lines; not line groups!)

You have been given sample awk scripts that work with the sample input you originally provided. Have you tried modifying those scripts to work with your (radically) different real input? What did you try? Where did you get stuck?

John_Lyon · August 2, 2015, 7:19pm

Thanks for the reply, apologies for being vague, I'm new to all this. To be clear, the following input consists of two example sentences. There are a combined total of 15, curly-bracket enclosed words in the \gla lines of these two examples (3 in the first, 12 in the second):

\gla {itl\'i\textglotstop } {k\textsuperscript{w}uk\textsuperscript{w}} {t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.}//  
\glb {itl\'i\textglotstop } {k\textsuperscript{w}uk\textsuperscript{w}} {tc+\ts x\textsuperscript{w}\'uy}// 
\glc \textsc{dem} \textsc{rep} \textsc{loc}+go // 
\glc from.there they.say came.over.this.way //
\glft `They said he was coming along.' //  

\gla {u\textbeltl } {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop } {lut} {i\textglotstop } {q\'aqx\textsuperscript{w}\textschwa lx} {ka\textglotstop} {cx\textsuperscript{w}uys} {i\textglotstop } {l} {siw\textbeltl k\textsuperscript{w}.} // 
\glb {u\textbeltl } {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop } {lut} {i\textglotstop } {q\'a(\tb)\ts qx\textsuperscript{w}lx} {ki\textglotstop} {c\textendash \ts x\textsuperscript{w}uy\textendash s} {i\textglotstop } {l} {siw\textbeltl k\textsuperscript{w}} // 
\glc \textsc{conj} say \textsc{rep} \textsc{dem} \textsc{neg} \textsc{det} fish \textsc{comp.obl} \textsc{cust}\textendash go\textendash \textsc{3sg.poss} \textsc{det} \textsc{loc} water // 
\glc and he.said they.say here no the fish where.that they.come the through water // 
\glft `Coyote said there will be no fish going through the water here.' //

Given this input, this is the initial output I'm looking for:

{itl\'i\textglotstop } & {itl\'i\textglotstop } & \textsc{dem} & from.there \\
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} & they.say \\
{t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.} & {tc+\ts x\textsuperscript{w}\'uy} & \textsc{loc}+go &  came.over.this.way \\
{u\textbeltl } & {u\textbeltl } & \textsc{conj} &  and \\
{cut} & {cut} & say & he.said \\
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} & they.say \\
{al\'a\textglotstop } &  {al\'a\textglotstop } & \textsc{dem} & here \\
{lut} & {lut} & \textsc{neg} & no \\
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the \\
{q\'aqx\textsuperscript{w}\textschwa lx} & {q\'a(\tb)\ts qx\textsuperscript{w}lx} & fish & fish \\
{ka\textglotstop} & {ki\textglotstop} & \textsc{comp.obl} & where.that \\
{cx\textsuperscript{w}uys} & {c\textendash \ts x\textsuperscript{w}uy\textendash s} &  \textsc{cust}\textendash go\textendash \textsc{3sg.poss} & they.come \\
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the \\
{l} & {l} & \textsc{loc} & through \\
{siw\textbeltl k\textsuperscript{w}} &  {siw\textbeltl k\textsuperscript{w}} & water & water \\

Then, these 15 lines would be sorted, the sort key being the first letter of the first word in each line, so the above 15 lines (corresponding to the total of 15 words in the \gla lines of the two unmodified examples), would be sorted like this:

{al\'a\textglotstop } &  {al\'a\textglotstop } & \textsc{dem} & here \\
{cut} & {cut} & say & he.said \\
{cx\textsuperscript{w}uys} & {c\textendash \ts x\textsuperscript{w}uy\textendash s} &  \textsc{cust}\textendash go\textendash \textsc{3sg.poss} & they.come \\
{itl\'i\textglotstop } & {itl\'i\textglotstop } & \textsc{dem} & from.there \\
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the \\
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the \\
{ka\textglotstop} & {ki\textglotstop} & \textsc{comp.obl} & where.that \\
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} & they.say \\
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} & they.say \\
{l} & {l} & \textsc{loc} & through \\
{lut} & {lut} & \textsc{neg} & no \\
{q\'aqx\textsuperscript{w}\textschwa lx} & {q\'a(\tb)\ts qx\textsuperscript{w}lx} & fish & fish \\
{siw\textbeltl k\textsuperscript{w}} &  {siw\textbeltl k\textsuperscript{w}} & water & water \\
{t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.} & {tc+\ts x\textsuperscript{w}\'uy} & \textsc{loc}+go &  came.over.this.way \\
{u\textbeltl } & {u\textbeltl } & \textsc{conj} &  and \\

Lines 5/6 and lines 8/9 above are duplicates, so the duplicate entries will be removed from the list, yielding 13 lines:

{al\'a\textglotstop } &  {al\'a\textglotstop } & \textsc{dem} & here \\
{cut} & {cut} & say & he.said \\
{cx\textsuperscript{w}uys} & {c\textendash \ts  x\textsuperscript{w}uy\textendash s} &  \textsc{cust}\textendash  go\textendash \textsc{3sg.poss} & they.come \\
{itl\'i\textglotstop } & {itl\'i\textglotstop } & \textsc{dem} & from.there \\
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the \\
{ka\textglotstop} & {ki\textglotstop} & \textsc{comp.obl} & where.that \\
{k\textsuperscript{w}uk\textsuperscript{w}} &  {k\textsuperscript{w}uk\textsuperscript{w}} &  \textsc{rep} &  they.say \\
{l} & {l} & \textsc{loc} & through \\
{lut} & {lut} & \textsc{neg} & no \\
{q\'aqx\textsuperscript{w}\textschwa lx} & {q\'a(\tb)\ts qx\textsuperscript{w}lx} & fish & fish \\
{siw\textbeltl k\textsuperscript{w}} &  {siw\textbeltl k\textsuperscript{w}} & water & water \\
{t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.}  & {tc+\ts x\textsuperscript{w}\'uy} & \textsc{loc}+go &   came.over.this.way \\
{u\textbeltl } & {u\textbeltl } & \textsc{conj} &  and \\

The result will be an alphabetized vocabulary list, ready to be dropped into a "tabularx" table environment in LaTeX.

I had some luck with danmero's suggestion:

awk 'END{for(l=1;l++<NF;)print o[l]}{for(l=I;l++<NF;){o[l]=((o[l])?o[l]FS:S)$l}}' file

However, it only worked if (a) all of the extra blank spaces within "words" were removed (since it seems to use blank spaces as a word delimiter), and (b) only one example at a time is modified (since I think it assumes "line names" do not occur multiple times). Both of these issues are my fault, for not being clear during the initial posting about the nature of the data I'm working with. Also, I don't yet know enough about awk to identify what in the above command needs changing. Thanks for your assistance and patience! I hope this helps to clarify.

Don_Cragun · August 3, 2015, 4:52pm

I'm confused, I thought you said that each <space> character in the 3rd and later lines in each input "sentence" separated "words". So, in the 12th line of your desired output:

{cx\textsuperscript{w}uys} & {c\textendash \ts x\textsuperscript{w}uy\textendash s} &  \textsc{cust}\textendash go\textendash \textsc{3sg.poss} & they.come \\

why are there two <space> characters in the middle of the single "word" marked in red from the following input line?:

\glc \textsc{conj} say \textsc{rep} \textsc{dem} \textsc{neg} \textsc{det} fish \textsc{comp.obl} \textsc{cust}\textendash go\textendash \textsc{3sg.poss} \textsc{det} \textsc{loc} water //

I thought I could modify my earlier suggested awk script to handle your new requirements, but since my code thinks there are two additional fields in the 9th line of your sample input, it gets confused and produces the wrong output.

John_Lyon · August 4, 2015, 10:53am

Thanks again for the reply.

LaTeX requires a blank space after some commands (or '{}'), which creates problems if blank spaces are word delimiters.

I had been assuming that it might be possible to use '} {' as a word delimiter, rather than a space, but then that would run into complications with the 3rd and fourth lines, where there are no '} {' delimiters.

So, below I've replaced all of the blank spaces within words with '{}' which will hopefully help. Thanks.

\gla {itl\'i\textglotstop{}} {k\textsuperscript{w}uk\textsuperscript{w}} {t\textschwa{}cx\textsuperscript{w}\'u\texthalflength\texthalflength{}y.}//   
\glb {itl\'i\textglotstop{}} {k\textsuperscript{w}uk\textsuperscript{w}} {tc+\ts{}x\textsuperscript{w}\'uy}// 
\glc \textsc{dem} \textsc{rep} \textsc{loc}+go //  
\glc from.there they.say came.over.this.way // 
\glft `They said he was coming along.' //    

\gla {u\textbeltl{}} {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop{}} {lut} {i\textglotstop{}} {q\'aqx\textsuperscript{w}\textschwa{}lx} {ka\textglotstop} {cx\textsuperscript{w}uys} {i\textglotstop{}} {l} {siw\textbeltl{}k\textsuperscript{w}.} //  
\glb {u\textbeltl{}} {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop{}} {lut} {i\textglotstop{}} {q\'a(\tb)\ts{}qx\textsuperscript{w}lx} {ki\textglotstop} {c\textendash{}\ts{}x\textsuperscript{w}uy\textendash{}s} {i\textglotstop{}} {l} {siw\textbeltl{}k\textsuperscript{w}} //  
\glc \textsc{conj} say \textsc{rep} \textsc{dem} \textsc{neg} \textsc{det} fish \textsc{comp.obl} \textsc{cust}\textendash go\textendash \textsc{3sg.poss} \textsc{det} \textsc{loc} water //  
\glc and he.said they.say here no the fish where.that they.come the through water //  
\glft `Coyote said there will be no fish going through the water here.' //

Don_Cragun · August 4, 2015, 2:06pm

john lyon:

Thanks again for the reply.

LaTeX requires a blank space after some commands (or '{}'), which creates problems if blank spaces are word delimiters.

I had been assuming that it might be possible to use '} {' as a word delimiter, rather than a space, but then that would run into complications with the 3rd and fourth lines, where there are no '} {' delimiters.

So, below I've replaced all of the blank spaces within words with '{}' which will hopefully help. Thanks.

\gla {itl\'i\textglotstop{}} {k\textsuperscript{w}uk\textsuperscript{w}} {t\textschwa{}cx\textsuperscript{w}\'u\texthalflength\texthalflength{}y.}//   
\glb {itl\'i\textglotstop{}} {k\textsuperscript{w}uk\textsuperscript{w}} {tc+\ts{}x\textsuperscript{w}\'uy}// 
\glc \textsc{dem} \textsc{rep} \textsc{loc}+go //  
\glc from.there they.say came.over.this.way // 
\glft `They said he was coming along.' //    

\gla {u\textbeltl{}} {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop{}} {lut} {i\textglotstop{}} {q\'aqx\textsuperscript{w}\textschwa{}lx} {ka\textglotstop} {cx\textsuperscript{w}uys} {i\textglotstop{}} {l} {siw\textbeltl{}k\textsuperscript{w}.} //  
\glb {u\textbeltl{}} {cut} {k\textsuperscript{w}uk\textsuperscript{w}} {al\'a\textglotstop{}} {lut} {i\textglotstop{}} {q\'a(\tb)\ts{}qx\textsuperscript{w}lx} {ki\textglotstop} {c\textendash{}\ts{}x\textsuperscript{w}uy\textendash{}s} {i\textglotstop{}} {l} {siw\textbeltl{}k\textsuperscript{w}} //  
\glc \textsc{conj} say \textsc{rep} \textsc{dem} \textsc{neg} \textsc{det} fish \textsc{comp.obl} \textsc{cust}\textendash go\textendash \textsc{3sg.poss} \textsc{det} \textsc{loc} water //  
\glc and he.said they.say here no the fish where.that they.come the through water //  
\glft `Coyote said there will be no fish going through the water here.' //

Huh??? The only difference in the 9th line in this sample and your previous sample is that there are two spaces following the // at the end of the line here when there was only one space following the // on that line in your previous sample!

If you could modify you input so that braces surrounded words on all lines (with the possible exception of the \glft lines) as you are currently doing with the \gla and \glb lines, that would make it easy to do what you want. Hybrid lines (as you said this example would provide) would be possible to process (but will require more complex code).

summer_cherry · August 5, 2015, 5:27am

python

with open("b.txt") as file:
	lines=file.readlines()
a=[
	[
		i[j]
		for i in 
		[
			j for j in [j.split(" ") for j in [line.replace("\n","") for line in lines ]]
		]
	]
	for j in range(len(lines[0].split(" ")))
]


for i in a[1:]:
	print(" ".join(i))

awk

awk '{
        for(i=1;i<=NF;i++){
                arr[i""NR]=$i
        }
        col=NF
        row=NR
}
END{
        for (i=2;i<=NF;i++){
          for (j=1;j<=NR;j++)
            printf("%s ", arr[i""j])
          print ""
}
}' a

John_Lyon · August 8, 2015, 6:35pm

Thanks very much again for the replies, I'll be travelling for the next couple weeks but will pick up this thread when I'm back.

Don_Cragun · August 8, 2015, 7:06pm

If the thread times out and is closed when you get back, send me a private message and I'll reopen it for you. (Don't start another thread; there is a lot of context here that we don't want to repeat in a new thread.)

RudiC · August 9, 2015, 12:34pm

This is building on Don Cragun's proposal in post#2 but extends it to deal with multiple data sets in one or more files. Unfortunately the problem with "spaces in words" in the \\glc - lines could not be solved and had to be dealt with manually (replaced them by underscores). Except for this, the output is close to what you desired in post#6:

awk '
/^\\glft/       {next
                }
/^\\gla/        {gsub ("} {", "}\&{")
                 BEG=MX
                 MX=NF 
                 DLT=NR-1
                }
/^\\glb/        {gsub ("} {", "}\&{")
                }
/^\\glc/        {sub (" ", "")
                 sub ("[ /]*$", "")
                 gsub(" ", "\&")   
                }

                {sub (/\\gl[abc] */,_)
                 $1=$1
                 for (i = 1; i <= NF; i++)
                        {o[i + BEG, NR - DLT] = $i
                        }
                }

END             {for(i = 1; i <= BEG+MX; i++)
                        for(j = 1; j <= 4; j++)
                                printf ("%s%s", o[i, j], j == 4 ? ORS : OFS)
                }
' FS="&"  OFS=" & " file | sort -u
{al\'a\textglotstop } & {al\'a\textglotstop } & \textsc{dem} & here
{cut} & {cut} & say & he.said
{cx\textsuperscript{w}uys} & {c\textendash \ts x\textsuperscript{w}uy\textendash s} & \textsc{cust}\textendash_go\textendash_\textsc{3sg.poss} & they.come
{i\textglotstop } & {i\textglotstop } & \textsc{det} & the
{itl\'i\textglotstop } & {itl\'i\textglotstop } & \textsc{dem} & from.there
{ka\textglotstop} & {ki\textglotstop} & \textsc{comp.obl} & where.that
{k\textsuperscript{w}uk\textsuperscript{w}} & {k\textsuperscript{w}uk\textsuperscript{w}} & \textsc{rep} & they.say
{l} & {l} & \textsc{loc} & through
{lut} & {lut} & \textsc{neg} & no
{q\'aqx\textsuperscript{w}\textschwa lx} & {q\'a(\tb)\ts qx\textsuperscript{w}lx} & fish & fish
{siw\textbeltl k\textsuperscript{w}.} //  & {siw\textbeltl k\textsuperscript{w}} //  & water & water
{t\textschwa cx\textsuperscript{w}\'u\texthalflength\texthalflength y.}//   & {tc+\ts x\textsuperscript{w}\'uy}//  & \textsc{loc}+go & came.over.this.way
{u\textbeltl } & {u\textbeltl } & \textsc{conj} & and