Script Shell Extract substring

chercheur111 · June 27, 2015, 10:03am

Hi all,

Please, i'd like to extract string just before '.fr'.
Here is some lines of my file:

g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2

i'd like to extract this text:

g-82.text.text1
g-xx.yyyyyy.zzzz

Please, which command i have to use in my script shell ?

Thank you much.
Best Regards.

sea · June 27, 2015, 11:44am

Hi, if its bash, this should work fine:

while read input
do  echo "${input/.fr.*}"
done<file

Prints:

g-82.text.text1
g-xx.yyyyyy.zzzz

hth

EDIT:
And to illustrate the substitution:

PATTERN=".fr."
VARIABLE="g-xx.yyyyyy.zzzz.fr.worker2"
echo "${VARIABLE/$PATTERN}"

Using a single / in ${VARIABLE} , removes the following pattern, including *?. expansion, unless quoted or escaped
Using two // will replace the the 1nd pattern string with the 2nd pattern string.
Example:

echo "${VARIABLE/$SEARCH/$REPLACE}"

However, this will only apply to the first found occourence!

Don_Cragun · June 27, 2015, 3:44pm

And, with any POSIX-conforming shell (e.g., ash , bash , dash , ksh , etc.) you could use:

while read -r input
do	echo "${input%.fr.*}"
done<file

or with sed :

sed 's/[.]fr[.].*//' file

or with awk :

awk '{sub(/[.]fr[.].*/, "")}1' file

or ...

Mannu2525 · June 27, 2015, 4:32pm

You could use :

sed 's/\(.fr.*\)//g' file

HTH

Don_Cragun · June 27, 2015, 5:05pm

No.

With an input line like:

g-xx.Africa.zzzz.fr.worker3

the above sed command would produce the output:

g-xx.

instead of the requested:

g-xx.Africa.zzzz

In a filename matching pattern, . matches a period. But in a basic regular expression (which is what is used in sed ), an unadorned . matches any single character.

Aia · June 27, 2015, 5:26pm

A few things you might want think about:
 : you are telling to capture the matched part. Since you are not using it, it is a waste of effort.
g : matching as many times as possible in a line, is not the intention in this case.

senhia83 · June 27, 2015, 6:15pm

How about

$ cat > tmp
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2


$ awk -F".fr" '{print $1}' tmp
g-82.text.text1
g-xx.yyyyyy.zzzz

Don_Cragun · June 27, 2015, 6:30pm

This suffers from exactly the same problem noted in post #5 in this thread!

senhia83 · June 27, 2015, 6:44pm

This might work?

$ cat > tmp
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3


$ awk -F"[.]fr" '{print $1}' tmp
g-82.text.text1
g-xx.yyyyyy.zzzz
g-xx.Africa.zzzz

Don_Cragun · June 27, 2015, 7:02pm

senhia83:

This might work?

$ cat > tmp
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3


$ awk -F"[.]fr" '{print $1}' tmp
g-82.text.text1
g-xx.yyyyyy.zzzz
g-xx.Africa.zzzz

No. Try adding an input line:

g-xx.freezer.zzzz.fr.worker4

You have to match both periods; not just the first one. But, it would work if you used:

-F'[.]fr[.]'
   or 
-F'\\.fr\\.'

instead of:

-F"[.]fr"

Peasant · June 28, 2015, 1:26am

Perhaps a sed like :

user@host:~/unix$ ls g*
g-82.text.text1.fr.worker1   g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3
user@host:~/unix$ ls g* | sed "s/\(.*\)\(.fr.*\)/\1/g"
g-82.text.text1
g-xx.Africa.zzzz
g-xx.yyyyyy.zzzz

So \1 is what you want and \2 (which is not used) is the extension being removed from the ls output.

Don_Cragun · June 28, 2015, 2:53am

peasant:

Perhaps a sed like :
user@host:~/unix$ ls g*
g-82.text.text1.fr.worker1   g-xx.yyyyyy.zzzz.fr.worker2
g-xx.Africa.zzzz.fr.worker3
user@host:~/unix$ ls g* | sed "s/$.*$$.fr.*$/\1/g"
g-82.text.text1
g-xx.Africa.zzzz
g-xx.yyyyyy.zzzz
So \1 is what you want and \2 (which is not used) is the extension being removed from the ls output.

In post #3 in this thread, I suggested simple, clear shell, sed and awk scripts that did exactly what was requested. Since then there have been several proposals that, although they work with the sample provided, would not work in the general case. Peasant's proposal above is more complicated than the sed suggestion I made earlier:

sed 's/[.]fr[.].*//' file

and, it works fine as long as the final string after the .fr. in the filenames being processed is literally worker followed by a string of digits. But, it fails if the two characters fr can ever appear anywhere in the filenames after the .fr. such as:

g-82.text.text1.fr.Alfred_Jones

With the sparse specification of the allowed filename format, we don't know if this is a problem in this case or not.

Mannu2525 · June 30, 2015, 1:44am

Thanks Don Cragun and Aia for clarification.

I have a doubt - If the text file is like -

cat file.txt
g-xx.fr.zzzz.fr.worker3
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2

then -

sed 's/[.]fr[.].*//' file.txt

will result -

g-xx
g-82.text.text1
g-xx.yyyyyy.zzzz

I hope below one is suitable if file structure remains the same.

awk -F"." '{print $1"."$2"."$3}' file.txt

or

awk -F'.' 'BEGIN{OFS="." ;} {print $1,$2,$3}' file.txt

this will result -

g-xx.fr.zzzz
g-82.text.text1
g-xx.yyyyyy.zzzz

Don_Cragun · June 30, 2015, 3:22am

mannu2525:

Thanks Don Cragun and Aia for clarification.

I have a doubt - If the text file is like -
cat file.txt
g-xx.fr.zzzz.fr.worker3
g-82.text.text1.fr.worker1
g-xx.yyyyyy.zzzz.fr.worker2
then -
sed 's/[.]fr[.].*//' file.txt
will result -
g-xx
g-82.text.text1
g-xx.yyyyyy.zzzz
I hope below one is suitable if file structure remains the same.
awk -F"." '{print $1"."$2"."$3}' file.txt
or
awk -F'.' 'BEGIN{OFS="." ;} {print $1,$2,$3}' file.txt
this will result -
g-xx.fr.zzzz
g-82.text.text1
g-xx.yyyyyy.zzzz

We assume that you realize that the requirements you have presented here are completely different from the requirements you presented in post #1 in this thread (where the request was to remove .fr. and everything following it). But, yes, with these new requirements, the above awk script does what you want.

Aia · June 30, 2015, 12:34pm

Don,
If you do not mind to explain why you chose to use a character class to represent the single period instead of escape it.

sed 's/[.]fr[.].*//' file.txt

Don_Cragun · June 30, 2015, 2:14pm

It's easier for me to use a single construct in a BRE or ERE that never changes depending on context.

All of the following commands remove the 1st two adjacent periods from each input line processed:

sed 's/\.\.//'
awk '{sub("\\.\\.","")}1'
awk -v RE='\\\\.\\\\.' '{sub(RE,"")}1'
sed 's/[.][.]//'
awk '{sub("[.][.]","")}1'
awk -v RE='[.][.]' '{sub(RE,"")}1'

Note the varying number of backslashes required when using escapes (one, two , or four for each escaped period), depending on context. But when using the character class expression, the same string works in all three cases.

Aia · June 30, 2015, 2:31pm

Don,
Thank you for explaining your thought process, related to it.