Creating a master file of conjugated verbs by concatenating root and inflection from separate files

gimley · January 16, 2018, 3:27am

Excuses for the long descriptive title.
I am working with Sindhi and developing a database of all verbal conjugations in that language.
I have generated 2 files:
Verbs.dic contains all the verbs, one verb per line
Inflections.dic contains the verbal conjugations which need to be appended to each verb.
An example will make this clear. I am choosing English for clarity and have chosen a very simple set, given the complexity of English verbs.
The input files are as under
Verbs.dic

walk
talk
seat
pick
laugh

Inflections.dic

s
ing
ed

What I need is a Perl or Awk script which will take the list of inflections from inflections .dic and append each of the inflections to each verb in the list in Verbs.dic. The resultant output would be as under:
Output

walks
walking
walked
talks
talking
talked
seats
seated
seating
picks
picking
picked
laughs
laughing
laughed

In English the list of inflections is pretty limited, in Sindhi the number of inflections range from 35-40 and generating them out manually is impossible.
Please note: I work unfortunately under a windows environment
All good wishes for the New Year and many thanks in advance

Don_Cragun · January 16, 2018, 5:17am

With well over 250 posts, we would hope that you could copy what has been done in lots of other treads in this forum doing more complex tasks than what are being requested here.

What have you tried to solve this problem on your own?

Is Cygwin installed on your system?

Do you have awk ?

Do you have bash or ksh ?

gimley · January 16, 2018, 8:47am

Dear Don Cragun,
Thanks for taking time off to reply. As usual, I hunted for this specific issue but could not find an answer. I hope I did not miss out the solution.
I have awk/sed and perl on my machine. Am waiting for the Fall Creator Update to be able to use Linux on my computer, which will make life easier for me.
Many thanks

---------- Post updated at 08:47 AM ---------- Previous update was at 08:43 AM ----------

Hello,
I found the answer. Thanks for alerting me:

awk 'FNR==NR {S[$1];next} {printf "%s", $1; for (s in S) printf ",%s%s", $1, s; printf "\n"}' suffs root

It was very stupid of me. Sorry for the bother. Thanks once again.

Don_Cragun · January 16, 2018, 7:18pm

You don't need to apologize. Just think before posting. You're capable of doing more than you seem to realize. I'm glad you were able to do what you needed to do.

Always show us what you have tried when posting a question. It helps us understand where you are stuck and helps us provide better guidance.

gimley · January 17, 2018, 2:06am

Thanks for your kind words. I am 70 years old. Accustomed to C programming and I guess in my hurry I forgot to check what is already available.

---------- Post updated 01-17-18 at 02:06 AM ---------- Previous update was 01-16-18 at 10:01 PM ----------

Dear Don,
Sorry to bother you.
I implemented the following awk script to handle the problem of concatenating two files where the first is the suffix file and the second is the root file

FNR==NR {S[$1];next} {printf "%s", $1; for (s in S) printf "/n%s%s", $1, s; printf "\n"}suffix root>root.out

The sample root file contains the following

walk
talk
seat
pick
laugh

The suffix file contains 3 suffixes in this order: The order is important.

s
ing
ed

However when the file is generated the order changes and a peculiar sort order is imposed. The output of the file is as under:

walk
walked
walks
walking
talk
talked
talks
talking

I have only pasted output of the first 2 verbs. As you can see the sort order is changed and is not the same. I have gone through the script and cannot detect which part what modifies the sort order. Is it because the files are in UTF8. I need this format to handle complex scripts like Devanagari or Arabic.
I desperately need the sort order in the suffix file to be retained.
If it is not too much trouble could you please comment the part of the script which modifies the sort order of the output.
Many thanks for your kind help

Don_Cragun · January 17, 2018, 3:20am

In awk the loop for(var in array) produces output in a random order. If the output order is important, you need to use integer indices and save the values in the array (instead of just saving the values in the array indices). For example:

awk '
FNR == NR {			# While reading the 1st file...
	suf[++c] = $0		# gather and count suffices.
	next
}
{				# While reading the 2nd file...
	print			# print verb by itself and...
	for(i = 1; i <= c; i++)
		print $0 suf	# print verb with suffices in order.
}' Inflections.dic Verbs.dic

with the sample files you provided in post #1 in this thread, produces the output:

walk
walks
walking
walked
talk
talks
talking
talked
seat
seats
seating
seated
pick
picks
picking
picked
laugh
laughs
laughing
laughed

Does this help?

gimley · January 17, 2018, 3:26am

Thanks a lot, especially for the code and the precious comments. I always assumed that awk respected the order in the file and did not disturb the same.
You made my day.