String matching using awk

shekhar2010us · August 30, 2012, 12:26am

Hello,

I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.

Here is a sample line in the file:

2.55  1.57        1992        10        20        30

The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').

Can anyone help.

Thanks,
Shekhar

---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------

I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.

---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------

I have gotten so far.

For 2nd tab field > 1990:

cat InputFile | awk -F"\t" '{if ($2 > 1990) print $0}' > OutputFile

For 1st tab field only alphabets

cat InputFile | awk -F"\t" '{if ($1 == "[a-zA-Z ]+") print $0}' > OutputFile

But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?

---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------

I have gotten this far:

awk -F"\t" '{if ($1 ~ /^[a-zA-Z ]+$/ && $2 > 1990) print}' InputFile > OutputFile

The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.

aa bb       1991      10       15        20

I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...

Thanks.

guruprasadpr · August 30, 2012, 12:54am

Hi

Appreciate your try on this.

Assuming the list file contains the list of strings :

$ cat list
aa
bb
cc

$ cat file
aa bb       1991      10       15        20
1e ff       1992      10       15        20
cc ff       1990      10       15        20
ee ff       1994      10       15        20

Output:

$ awk 'NR==FNR{a[$0]=1;next}($1 ~ /^[a-z]+$/ && $2 ~ /^[a-z]+$/ && (!a[$1]) && (!a[$2]) && $3>1990)' list file
ee ff       1994      10       15        20

Is this what you wanted?

Guru.

raj_saini20 · August 30, 2012, 12:59am

what do you mean by not in a list. For which list you are talking.

shekhar2010us · August 30, 2012, 1:41am

@guruprasadpr

Awesome answer. I knew 'awk' can solve this problem. To filter 1 file, java took 27 minutes and shell took 12 seconds

I added a filter of length of both the first fields and my final answer is:

awk 'NR==FNR{a[$0]=1;next}($1 ~ /^[a-z]+$/ && $2 ~ /^[a-z]+$/ && (!a[$1]) && (!a[$2]) && $3>1990 && length($1) > 2 && length($2) > 2 )' list file

Thanks a lot.

---------- Post updated at 01:41 AM ---------- Previous update was at 01:24 AM ----------

I just ran the awk for a bigger file and it took 154 seconds to get the results.
Thanks to @guruprasadpr..
I am sure Java would have taken hours to do this.