Hello,
I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.
Here is a sample line in the file:
2.55 1.57 1992 10 20 30
The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').
Can anyone help.
Thanks,
Shekhar
---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------
I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.
---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------
I have gotten so far.
For 2nd tab field > 1990:
cat InputFile | awk -F"\t" '{if ($2 > 1990) print $0}' > OutputFile
For 1st tab field only alphabets
cat InputFile | awk -F"\t" '{if ($1 == "[a-zA-Z ]+") print $0}' > OutputFile
But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?
---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------
I have gotten this far:
awk -F"\t" '{if ($1 ~ /^[a-zA-Z ]+$/ && $2 > 1990) print}' InputFile > OutputFile
The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.
aa bb 1991 10 15 20
I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...
Thanks.