How to ignore characters and print only numbers using awk?

sdf · May 28, 2013, 4:11pm

Input:

ak=70&cat15481=lot=6991901">Kaschau (1820-1840)
ak=7078&cat15482=lot=70121">Principaut� (1940-1993)
ak=709&cat=lot15484=70183944">Arubas (4543-5043)

Output:

70 15481 6991901
7078 15482 70121
709 15484 70183944

Yoda · May 28, 2013, 4:38pm

Here is an awk approach:

awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file

Jotne · May 28, 2013, 4:44pm

awk -F"[a-z=&\"]*" '{print $2,$3,$4}' infile
70 15481 6991901
7078 15482 70121
709 15484 70183944

@Yoda
I get this with your solution

70 15481 6991901
7078 15482 70121 �
709 15484 70183944

sdf · May 28, 2013, 4:50pm

Thanks, both work perfectly!

Yoda · May 28, 2013, 4:54pm

@Jotne, that looks like a control character.

I didn't get it when I copied it to a file. By the way it can be removed using character class: [:cntrl:]

alister · May 28, 2013, 5:06pm

Without awk:

sed 's/[^[:digit:]]/ /g; s/  */ /; s/^  *//'

If leading whitespace is acceptable:

tr -sc '[:digit:]' '[ *]'

Regards,
Alister

jlliagre · May 28, 2013, 5:21pm

@Yoda � is not a control character but a regular alphabetical accented one. Your script properly handle it as long as the character set is correct.

@Jotne You need to set the character set to UTF8:

$ LC_ALL=C awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file         
70 15481 6991901
7078 15482 70121 �
709 15484 70183944
$ LC_ALL=en_US.UTF8 awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file
70 15481 6991901
7078 15482 70121
709 15484 70183944

Here is a simpler way that isn't affected by the character set issue:

awk '{gsub("[^[:digit:]]+"," ");print $1,$2,$3}' file

alister · May 28, 2013, 5:35pm

It's not a control character. It's e-acute and it's present in the OP's text.

Jotne's probably using a locale whose alpha class does not include that letter, e.g. the POSIX locale.

Further, while e-acute is a single byte in iso8859-1, in utf-8 it's multibyte.

Your approach could be made simpler and more robust by using the complement of the digit class.

Regards,
Alister

---------- Post updated at 05:35 PM ---------- Previous update was at 05:27 PM ----------

I see that I was beaten to it by 6 minutes. Obviously, I concur with jlliagre.

Regards,
Alister

Jotne · May 29, 2013, 2:10am

Or use my solution in post #3

jlliagre · May 29, 2013, 4:48am

Indeed, an astute way of using delimiters.

Here is the shortest awk based answer I can think of, using Yoda tricks:

awk '{gsub("[^0-9]+"," ");NF-=2}1' infile

MadeInGermany · May 30, 2013, 12:38pm

I like the tr solution.
But it needs \n to not fold the lines.
Retain the original number positions:

tr -c '[:digit:]\n' '[ *]' < file

Sqeeze the space to one:

tr -sc '[:digit:]\n' '[ *]' < file

BTW with sed these are

sed 's/[^[:digit:]]/ /g' file

sed 's/[^[:digit:]]\{1,\}/ /g' file

In \{n,m\} a missing m means "at least n".
In ERE this is {n,m} - but some awk do not have it implemented (and {1,} is the same as + ).

alister · May 30, 2013, 12:59pm

Woops. Nice catch. Thank you for pointing it out.

Regards,
Alister