Grep a pattern in a key position a file

DNM_UKN · May 21, 2018, 7:31am

Hi guys,

So I have a file containing data of a marathon. Here's an example what it looks like including the given key:

# key: sex, time, athlete, athlete's nationality, date, city, country
M, 2:30:57.6, Harry Payne, GBR, 1929-07-05, Stamford Bridge, England
M, 2:5:42, Khalid Khannouchi, MAR, 1999-10-24, Chicago, USA
M, 2:5:37.8, Khalid Khannouchi, USA, 2002-04-14, London, USA

My task is to extract all lines that feature a runner's last name starting with a C.
What I did was; I sorted the file according to key 4 and redirected the output to a new file, which gave me the list sorted in alphabetical order according to their last name. I used this command to do this;

sort -k 4 marathon > t21a

.

Does anyone know if there is a command that allows me to grep all entries which have the last name starting with C from the key position these terms are in?
Any help would be much appreciated.

I have tried to grep from a key position but the option I used(from sort) does not exist for grep;

grep -k 4 '^[C,c]' t21b > t21c

Thank you

Scrutinizer · May 21, 2018, 8:11am

Hi,

awk can do that. Try:

awk '$4~/^[Cc]/' t21b

The field separator here is comma-space so I would be inclined to use that:

awk -F', *' '$3~/ [cC]/' t21b

Note that these approaches only work if all persons have exactly two names.

The following approach would use the last name in field 3:

awk -F', *' '{n=split($3,F," ")} F[n]~/^[cC]/ t21b

DNM_UKN · May 21, 2018, 9:56am

I'll try it out right now. Thank you

---------- Post updated at 12:16 PM ---------- Previous update was at 12:12 PM ----------

Heyyy it worked, thank you so much

---------- Post updated at 01:56 PM ---------- Previous update was at 12:16 PM ----------

Hey,
So I'm having issues sorting a data set.
The data set contains entries as such;

# key: sex, time, athlete, athlete's nationality, date, city, country

M, 2:30:57.6, Harry Payne, GBR, 1929-07-05, Stamford Bridge, England
M, 2:5:42, Khalid Khannouchi, MAR, 1999-10-24, Chicago, USA
M, 2:5:37.8, Khalid Khannouchi, USA, 2002-04-14, London, UK
M, 2:4:48, Patrick Makau Musyoki, KEN, 2010-04-11, Rotterdam, Netherlands

I now want to sort this file according to the name of the athlete's, however, I don't want to sort it alphabetically but according to the number of names the athletes have. So for instance, I would like to sort the entire document starting with the entries of all the athletes that have 2-word names, then all the athletes with 3-word names and so on. Is this possible using sort?

I have tried various commands;

sort -k 4-5|4-6 t22 > t22a

sort -k 4,5 | sort -k 4,6 t22 > t22a

sort -kb 4 t22 > t22a

sort -k 4 -b t22 > t22a

But I haven't managed to sort the data in the way I want it.
Does anyone know what I'm doing wrong?
Thanks in advance for your help.

Scrutinizer · May 21, 2018, 4:01pm

Try introducing an extra column 1, that contains the number of words in column 3, then sorting on that column and then removing the extra column, for example:

awk -F', *' 'NR>1 && NF{print split($3,F," "), $0}' OFS='\t' t22 | sort -k1,1n | cut -f2-

NR>1 && NF skips the header and the empty line and split($3,F," ") produces the number of words in column 3