Prepend text, different matched 1st letters

p1ne · August 1, 2016, 1:57pm

Prepending lines with: your, the, a or an based on 1st letter match. You'll see my problem below:

sed '/^p\|^f\|^c\|^d\|^l/ s/^/your /' list.txt > your.txt && sed  '/^v\|^j\|^k\|^m\|^n\|^s/ s/^/the /' your.txt > the.txt && sed '/^b\|^g\|^h\|^q\|^r\|^t\|^w\|^z/ s/^/a /' the.txt > a.txt  && sed '/^a\|^e\|^i\|^o\|^u/ s/^/an /' a.txt > an.txt

Example of bad overlapping output:

an a reinforced hose
an a the non-reinforced hose

Thanks for help.

rdrtx1 · August 1, 2016, 2:15pm

awk '
/^[pfcdl]/ {$0="your " $0; print ; next}
/^[vjkmns]/ {$0="the " $0; print ; next}
/^[bghqrtwz]/ {$0="a " $0; print ; next}
/^[aeiou]/ {$0="an " $0; print ; next}
1' list.txt

RudiC · August 1, 2016, 3:07pm

You need to make sure that once a modification is done, it won't be subject to another one (e.g. by adapting the sequence of modifications). Try

sed 's/^[aeiou]/an &/; s/^[bghqrtwz]/a &/; s/^[pfcdl]/your &/; s/^[vjkmns]/the &/ ' file
a reinforced hose
the non-reinforced hose

Don_Cragun · August 1, 2016, 3:25pm

Note also that if you're trying to use "a" or "an" before an English word, you need more than just the first letter to make the decision: a hose is correct and an hose is wrong, but an hour is correct and a hour is wrong.

RavinderSingh13 · August 1, 2016, 3:33pm

Hello p1ne,

rdrtx1's solution could be little modified(where removing the string concatination to line and directly printing it rather) to as follows may help you in same too.

awk '
/^[pfcdl]/ {print "your " $0 ; next}
/^[vjkmns]/ {print "the " $0 ; next}
/^[bghqrtwz]/ {print "a " $0 ;  next}
/^[aeiou]/ {print "an " $0   ; next}
1' list.txt

Thanks,
R. Singh

p1ne · August 1, 2016, 4:34pm

Thanks rdrtx1, RudiC and R. Singh for awk and sed examples. Working great! The sed example puts letters in array and avoids overwriting by &/; which the awk example does by next.

Good point Don about "h" words. Awk example works by adding:

/^["hour"]/ {print "an " $0 ; next}

and "a hose" etc. is maintained. But modifying sed example:

sed 's/^\<hour\>/an &/;

prints: an an hour

Don_Cragun · August 1, 2016, 6:54pm

p1ne:

Thanks rdrtx1, RudiC and R. Singh for awk and sed examples. Working great! The sed example puts letters in array and avoids overwriting by &/; which the awk example does by next.

Good point Don about "h" words. Awk example works by adding:
/^["hour"]/ {print "an " $0 ; next}
and "a hose" etc. is maintained. But modifying sed example:
sed 's/^\<hour\>/an &/;
prints: an an hour

No. In awk /^["hour"]/ will select any line starting with " , starting with h , starting with o , starting with u , or starting with r . What you want is something considerably more complex like:

/^(heir([^s]|$)|honest|honor([^s]|$)|hour([^s]|$))/{print "an " $0; next}
/^(heirs|honors|hours)/{print "the " $0; next}

for US English, or:

/^(heir([^s]|$)|honest|honour([^s]|$)|hour([^s]|$))/{print "an " $0; next}
/^(heirs|honours|hours)/{print "the " $0; next}

for UK English. Note that I think this will correctly handle cases like:

an heir
an heiress
the heirs
an heirloom
the heirs' bequests
an heir's bequest
an hour
an hourly
the hours
the honors
an honorific

but I am not at all sure that this list of exceptions is anywhere close to complete.

-----------------------

Update: The above does not take herb into account. And, you can't always tell how to handle it just from the spelling. Some men's names (both Herb and Herbert) have a silent H and some have a verbalized H. Since your code only processes lower case letters, maybe you don't care about proper names.

RudiC · August 2, 2016, 3:07am

No, it does not. Only if embedded in the sed script given, it will, in two steps: first hour is replaced by an hour , then an is replaced by an an . And this is exactly what I said above: You'll need to keep it from being modified twice.

p1ne · August 2, 2016, 8:38am

I'm sorry, by modifying sed example I meant embedded in your example prints "an an hour." I understand why (2nd statement matching "(a)n" from 1st...so how to fix?

sed 's/^\<hour\>/an &/; s/^[aeiou]/an &/; s/^[bcdfghjklmnpqrstvwxyz]/a &/ ' test > test2

---------- Post updated at 08:38 AM ---------- Previous update was at 08:27 AM ----------
Don:
Thanks for taking the time to demonstrate awk example. I must have been confused before when I ran array ["hour"] and thought I saw the correct result. Just re-tested and indeed it doesn't work. So pipes | are used for complete words.

RudiC · August 2, 2016, 8:46am

You might want to make sed quit after the first modification.

Don_Cragun · August 2, 2016, 2:17pm

p1ne:

I'm sorry, by modifying sed example I meant embedded in your example prints "an an hour." I understand why (2nd statement matching "(a)n" from 1st...so how to fix?
sed 's/^\<hour\>/an &/; s/^[aeiou]/an &/; s/^[bcdfghjklmnpqrstvwxyz]/a &/ ' test > test2
---------- Post updated at 08:38 AM ---------- Previous update was at 08:27 AM ----------
Don:
Thanks for taking the time to demonstrate awk example. I must have been confused before when I ran array ["hour"] and thought I saw the correct result. Just re-tested and indeed it doesn't work. So pipes | are used for complete words.

The pipe symbols in the extended regular expressions used in awk (and grep -E and some other utilities) separate alternatives. They don't have to be complete words. The line of code I suggested:

/^(heirs|honors|hours)/{print "the " $0; next}

could also be written as:

/^h(eir|onor|our)s/{print "the " $0; next}

(which looks for the letter "h" at the start of a line followed by one of the strings "eir", "onor", or "our" followed by the letter "s") and get exactly the same results with less typing. I just find the first form easier for many of the novices who read this forum to understand. (And the longer form keeps automatic spell checkers from trying to correct non-existent typos. )