Prepend text, different matched 1st letters

Prepending lines with: your, the, a or an based on 1st letter match. You'll see my problem below:

sed '/^p\|^f\|^c\|^d\|^l/ s/^/your /' list.txt > your.txt && sed  '/^v\|^j\|^k\|^m\|^n\|^s/ s/^/the /' your.txt > the.txt && sed '/^b\|^g\|^h\|^q\|^r\|^t\|^w\|^z/ s/^/a /' the.txt > a.txt  && sed '/^a\|^e\|^i\|^o\|^u/ s/^/an /' a.txt > an.txt 

Example of bad overlapping output:

an a reinforced hose
an a the non-reinforced hose

Thanks for help.

awk '
/^[pfcdl]/ {$0="your " $0; print ; next}
/^[vjkmns]/ {$0="the " $0; print ; next}
/^[bghqrtwz]/ {$0="a " $0; print ; next}
/^[aeiou]/ {$0="an " $0; print ; next}
1' list.txt
1 Like

You need to make sure that once a modification is done, it won't be subject to another one (e.g. by adapting the sequence of modifications). Try

sed 's/^[aeiou]/an &/; s/^[bghqrtwz]/a &/; s/^[pfcdl]/your &/; s/^[vjkmns]/the &/ ' file
a reinforced hose
the non-reinforced hose
1 Like

Note also that if you're trying to use "a" or "an" before an English word, you need more than just the first letter to make the decision: a hose is correct and an hose is wrong, but an hour is correct and a hour is wrong.

Hello p1ne,

rdrtx1's solution could be little modified(where removing the string concatination to line and directly printing it rather) to as follows may help you in same too.

awk '
/^[pfcdl]/ {print "your " $0 ; next}
/^[vjkmns]/ {print "the " $0 ; next}
/^[bghqrtwz]/ {print "a " $0 ;  next}
/^[aeiou]/ {print "an " $0   ; next}
1' list.txt

Thanks,
R. Singh

1 Like

Thanks rdrtx1, RudiC and R. Singh for awk and sed examples. Working great! The sed example puts letters in array and avoids overwriting by &/; which the awk example does by next.

Good point Don about "h" words. Awk example works by adding:

/^["hour"]/ {print "an " $0 ; next}

and "a hose" etc. is maintained. But modifying sed example:

sed 's/^\<hour\>/an &/;

prints: an an hour

No. In awk /^["hour"]/ will select any line starting with " , starting with h , starting with o , starting with u , or starting with r . What you want is something considerably more complex like:

/^(heir([^s]|$)|honest|honor([^s]|$)|hour([^s]|$))/{print "an " $0; next}
/^(heirs|honors|hours)/{print "the " $0; next}

for US English, or:

/^(heir([^s]|$)|honest|honour([^s]|$)|hour([^s]|$))/{print "an " $0; next}
/^(heirs|honours|hours)/{print "the " $0; next}

for UK English. Note that I think this will correctly handle cases like:

an heir
an heiress
the heirs
an heirloom
the heirs' bequests
an heir's bequest
an hour
an hourly
the hours
the honors
an honorific

but I am not at all sure that this list of exceptions is anywhere close to complete.

-----------------------

Update: The above does not take herb into account. And, you can't always tell how to handle it just from the spelling. Some men's names (both Herb and Herbert) have a silent H and some have a verbalized H. Since your code only processes lower case letters, maybe you don't care about proper names.

1 Like

No, it does not. Only if embedded in the sed script given, it will, in two steps: first hour is replaced by an hour , then an is replaced by an an . And this is exactly what I said above: You'll need to keep it from being modified twice.

I'm sorry, by modifying sed example I meant embedded in your example prints "an an hour." I understand why (2nd statement matching "(a)n" from 1st...so how to fix?

sed 's/^\<hour\>/an &/; s/^[aeiou]/an &/; s/^[bcdfghjklmnpqrstvwxyz]/a &/ ' test > test2

---------- Post updated at 08:38 AM ---------- Previous update was at 08:27 AM ----------
Don:
Thanks for taking the time to demonstrate awk example. I must have been confused before when I ran array ["hour"] and thought I saw the correct result. Just re-tested and indeed it doesn't work. So pipes | are used for complete words.

You might want to make sed quit after the first modification.

The pipe symbols in the extended regular expressions used in awk (and grep -E and some other utilities) separate alternatives. They don't have to be complete words. The line of code I suggested:

/^(heirs|honors|hours)/{print "the " $0; next}

could also be written as:

/^h(eir|onor|our)s/{print "the " $0; next}

(which looks for the letter "h" at the start of a line followed by one of the strings "eir", "onor", or "our" followed by the letter "s") and get exactly the same results with less typing. I just find the first form easier for many of the novices who read this forum to understand. (And the longer form keeps automatic spell checkers from trying to correct non-existent typos. :eek: )

1 Like