Input:
ak=70&cat15481=lot=6991901">Kaschau (1820-1840)
ak=7078&cat15482=lot=70121">Principaut� (1940-1993)
ak=709&cat=lot15484=70183944">Arubas (4543-5043)
Output:
70 15481 6991901
7078 15482 70121
709 15484 70183944
Input:
ak=70&cat15481=lot=6991901">Kaschau (1820-1840)
ak=7078&cat15482=lot=70121">Principaut� (1940-1993)
ak=709&cat=lot15484=70183944">Arubas (4543-5043)
Output:
70 15481 6991901
7078 15482 70121
709 15484 70183944
Here is an awk approach:
awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file
awk -F"[a-z=&\"]*" '{print $2,$3,$4}' infile
70 15481 6991901
7078 15482 70121
709 15484 70183944
@Yoda
I get this with your solution
70 15481 6991901
7078 15482 70121 �
709 15484 70183944
Thanks, both work perfectly!
@Jotne, that looks like a control character.
I didn't get it when I copied it to a file. By the way it can be removed using character class: [:cntrl:]
Without awk:
sed 's/[^[:digit:]]/ /g; s/ */ /; s/^ *//'
If leading whitespace is acceptable:
tr -sc '[:digit:]' '[ *]'
Regards,
Alister
@Yoda �
is not a control character but a regular alphabetical accented one. Your script properly handle it as long as the character set is correct.
@Jotne You need to set the character set to UTF8:
$ LC_ALL=C awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file
70 15481 6991901
7078 15482 70121 �
709 15484 70183944
$ LC_ALL=en_US.UTF8 awk '{gsub(/[[:punct:]]+|[[:alpha:]]+/," ");sub(/^[ ]*/,x);NF-=2}1' file
70 15481 6991901
7078 15482 70121
709 15484 70183944
Here is a simpler way that isn't affected by the character set issue:
awk '{gsub("[^[:digit:]]+"," ");print $1,$2,$3}' file
It's not a control character. It's e-acute and it's present in the OP's text.
Jotne's probably using a locale whose alpha class does not include that letter, e.g. the POSIX locale.
Further, while e-acute is a single byte in iso8859-1, in utf-8 it's multibyte.
Your approach could be made simpler and more robust by using the complement of the digit class.
Regards,
Alister
---------- Post updated at 05:35 PM ---------- Previous update was at 05:27 PM ----------
I see that I was beaten to it by 6 minutes. Obviously, I concur with jlliagre.
Regards,
Alister
Or use my solution in post #3
Indeed, an astute way of using delimiters.
Here is the shortest awk based answer I can think of, using Yoda tricks:
awk '{gsub("[^0-9]+"," ");NF-=2}1' infile
I like the tr solution.
But it needs \n to not fold the lines.
Retain the original number positions:
tr -c '[:digit:]\n' '[ *]' < file
Sqeeze the space to one:
tr -sc '[:digit:]\n' '[ *]' < file
BTW with sed these are
sed 's/[^[:digit:]]/ /g' file
sed 's/[^[:digit:]]\{1,\}/ /g' file
In \{n,m\}
a missing m means "at least n".
In ERE this is {n,m}
- but some awk do not have it implemented (and {1,}
is the same as +
).
Woops. Nice catch. Thank you for pointing it out.
Regards,
Alister