sed - extracting the first word only if match

jcdole · October 1, 2013, 5:58pm

Hello.

from a text file, I want to get only the first word ( before blank ) following code=

grep -i -e "WORD1"  "/path/to/text/file.txt | sed -n 's/WORD1[ ]\+//p' | sed -n 's/code=/\1/p'

return an error.

sed: -e expression #1, char 12: invalid reference \1 on `s' command's RHS

For debugging I have try this

asus:~ # echo "WORD1 code1=value1-idkey1 code2=value1" |  sed -n 's/WORD1[ ]\+//p' | sed -n -e 's/code1=//p'
value1-idkey1 code2=value1
asus:~ #

Then to remove the end of the line I have try this, but get an error

asus:~ # echo "WORD1 code1=value1-idkey1 code2=value1"  |  sed -n 's/WORD1[ ]\+//p' | sed -n -e 's/code1=/\1/p'
sed: -e expression #1, char 13: invalid reference \1 on `s' command's RHS
asus:~ #

I want this :

asus:~ # echo "WORD1 code1=value1-idkey1 code2=value1"  |  sed -n 's/WORD1[ ]\+//p' | sed -n -e xxxxxxxxxxxxxxxxx'
value1-idkey1
asus:~ #

Any help is welcome.

disedorgue · October 1, 2013, 6:22pm

Hi,
You must catch argument if you want use this:

echo "WORD1 code1=value1-idkey1 code2=value1"  |  sed -n 's/.*code1=\([^ ]\+\).*/\1/p'

\1 ==> catch either $...$
Regards.

MadeInGermany · October 1, 2013, 10:09pm

Include the WORD1 in the search

sed -n 's/WORD1[ ]\+code1=\([^ ]\+\).*/\1/p'

Jotne · October 2, 2013, 4:51am

awk -F"[ =]" '/WORD1/ {print $3}' file
value1-idkey1

jcdole · October 2, 2013, 6:53am

Thank you everybody.

But it would be great if you can explain how does it works.

MadeInGermany · October 2, 2013, 7:07am

sed:
the \1 in the substitution returns what is matched within the  in the search.
The search must end with .* , so the rest of the line is discarded.
awk:
by using a clever set of field delimiters (" " and "=") the input line is split into "WORD1", "code1", "value1-idkey1", "code2", "value1"
They can be referenced as $1, $2, $3, $4, $5, respectively.

jethrow · October 3, 2013, 12:24am

Here's another example:

echo "WORD1 code1=value1-idkey1 code2=value1" | perl -pe "s;.*?=(\S+).*;\1;"

Jotne · October 3, 2013, 1:37am

awk -F"[ =]" '		#setting the field separator to space or equal
	/WORD1/ {	#search for lines containing "WORD1"
	print $3	#print filed number 3
	}' file

jcdole · October 17, 2013, 6:40am

Thank you every body for your help.
The simple test I made with sed in shell script work well from your recomendations.

Now I have written a small piece of code using Qt Creator.

My question is about regular expression not Qt.

In term of regular expression, what is the meaning of :

"code=(.+)\\s"

I think that my problem arise because I must tell (don't know how !!) that the end of the word is the first encountered blank, not the last one if there are more than one.

As you know my purpose is to extract some character code. For that I am using this code :

QString town_code;
QRegularExpression re1("code=(.+)\\s");
QRegularExpressionMatch match = re1.match (a_text);
hasMatch = match.hasMatch();
if (hasMatch) {
town_code = match.captured(1);
............
............
............
}

This code work in some conditions, and failed in other:

The result is correct for:

ARANTZAZU (O�ATI) code=onati-id20059 ville=O�ATI

The town_code variable contains the correct value:

onati-id20059

The result is incorrect for:

OTSAURTE (ZEGAMA) code=Zegama-id20025 ville=ZEGAMA tameteo=http://www.tameteo.com/meteo_Zegama-Europe-Espagne-Guipuscoa--1-3385.html

The town_code variable contains the incorrect value:

Zegama-id20025 ville=ZEGAMA

Any help is welcome.

CarloM · October 17, 2013, 7:17am

I haven't used Qt, but I believe it's:
code= - the literal code= , followed by
(.+) - one or more of any character as bracket expression #1, followed by
\\s - whitespace (normally just \s , but I assume it's escaped for C++ strings)

The bracket expression is so you can extract that matched part afterwards (presumably by match.catured(1) ).

EDIT: Your regex is matching too much - a regular expression will consume as much of the string as it can. In this case, that's anything from code= up to (but excluding) the last word on the line.

Rather than using (.+) to match any character, try using ([^\\s]+) to match non-whitespace characters (as in the earlier sed suggestions).

jcdole · October 17, 2013, 8:59am

Great.

Seems working on some tries
.
Thank you very much.

But could you explain the syntax.

CarloM · October 17, 2013, 9:49am

[^\\s] defines an inverse character set - i.e. match any character except those specified (in this case, match anything except whitespace).

jcdole · November 6, 2013, 1:31pm

Thank you for taking time for helping.