Script Shell Extract substring

Hi all,

Please, i'd like to extract string just before '.fr'.
Here is some lines of my file:

g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2

i'd like to extract this text:

g-82.text.text1
g-xx.yyyyyy.zzzz

Please, which command i have to use in my script shell ?

Thank you much.
Best Regards.

Hi, if its bash, this should work fine:

while read input
do  echo "${input/.fr.*}"
done<file

Prints:

g-82.text.text1
g-xx.yyyyyy.zzzz

hth

EDIT:
And to illustrate the substitution:

PATTERN=".fr."
VARIABLE="g-xx.yyyyyy.zzzz.fr.worker2"
echo "${VARIABLE/$PATTERN}"

Using a single / in ${VARIABLE} , removes the following pattern, including *?. expansion, unless quoted or escaped
Using two // will replace the the 1nd pattern string with the 2nd pattern string.
Example:

echo "${VARIABLE/$SEARCH/$REPLACE}"

However, this will only apply to the first found occourence!

1 Like

And, with any POSIX-conforming shell (e.g., ash , bash , dash , ksh , etc.) you could use:

while read -r input
do	echo "${input%.fr.*}"
done<file

or with sed :

sed 's/[.]fr[.].*//' file

or with awk :

awk '{sub(/[.]fr[.].*/, "")}1' file

or ...

1 Like

You could use :

sed 's/\(.fr.*\)//g' file

HTH

No.

With an input line like:

g-xx.Africa.zzzz.fr.worker3

the above sed command would produce the output:

g-xx.

instead of the requested:

g-xx.Africa.zzzz

In a filename matching pattern, . matches a period. But in a basic regular expression (which is what is used in sed ), an unadorned . matches any single character.

A few things you might want think about:
\(\) : you are telling to capture the matched part. Since you are not using it, it is a waste of effort.
g : matching as many times as possible in a line, is not the intention in this case.

How about

$ cat > tmp
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2


$ awk -F".fr" '{print $1}' tmp
g-82.text.text1
g-xx.yyyyyy.zzzz

This suffers from exactly the same problem noted in post #5 in this thread!

1 Like

This might work?

$ cat > tmp
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3


$ awk -F"[.]fr" '{print $1}' tmp
g-82.text.text1
g-xx.yyyyyy.zzzz
g-xx.Africa.zzzz

No. Try adding an input line:

g-xx.freezer.zzzz.fr.worker4

You have to match both periods; not just the first one. But, it would work if you used:

-F'[.]fr[.]'
   or 
-F'\\.fr\\.'

instead of:

-F"[.]fr"
1 Like

Perhaps a sed like :

user@host:~/unix$ ls g*
g-82.text.text1.fr.worker1   g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3
user@host:~/unix$ ls g* | sed "s/\(.*\)\(.fr.*\)/\1/g"
g-82.text.text1
g-xx.Africa.zzzz
g-xx.yyyyyy.zzzz

So \1 is what you want and \2 (which is not used) is the extension being removed from the ls output.

In post #3 in this thread, I suggested simple, clear shell, sed and awk scripts that did exactly what was requested. Since then there have been several proposals that, although they work with the sample provided, would not work in the general case. Peasant's proposal above is more complicated than the sed suggestion I made earlier:

sed 's/[.]fr[.].*//' file

and, it works fine as long as the final string after the .fr. in the filenames being processed is literally worker followed by a string of digits. But, it fails if the two characters fr can ever appear anywhere in the filenames after the .fr. such as:

g-82.text.text1.fr.Alfred_Jones

With the sparse specification of the allowed filename format, we don't know if this is a problem in this case or not.

1 Like

Thanks Don Cragun and Aia for clarification.

I have a doubt - If the text file is like -

cat file.txt
g-xx.fr.zzzz.fr.worker3
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2

then -

sed 's/[.]fr[.].*//' file.txt

will result -

g-xx
g-82.text.text1
g-xx.yyyyyy.zzzz

I hope below one is suitable if file structure remains the same.

awk -F"." '{print $1"."$2"."$3}' file.txt

or

awk -F'.' 'BEGIN{OFS="." ;} {print $1,$2,$3}' file.txt

this will result -

g-xx.fr.zzzz
g-82.text.text1
g-xx.yyyyyy.zzzz

We assume that you realize that the requirements you have presented here are completely different from the requirements you presented in post #1 in this thread (where the request was to remove .fr. and everything following it). But, yes, with these new requirements, the above awk script does what you want.

Don,
If you do not mind to explain why you chose to use a character class to represent the single period instead of escape it.

sed 's/[.]fr[.].*//' file.txt

It's easier for me to use a single construct in a BRE or ERE that never changes depending on context.

All of the following commands remove the 1st two adjacent periods from each input line processed:

sed 's/\.\.//'
awk '{sub("\\.\\.","")}1'
awk -v RE='\\\\.\\\\.' '{sub(RE,"")}1'
sed 's/[.][.]//'
awk '{sub("[.][.]","")}1'
awk -v RE='[.][.]' '{sub(RE,"")}1'

Note the varying number of backslashes required when using escapes (one, two , or four for each escaped period), depending on context. But when using the character class expression, the same string works in all three cases.

Don,
Thank you for explaining your thought process, related to it.