Newbie help - parsing through a file

Hello guys,

I am a newbie to all of this - I'd like some help with a file I have. It's a ~100mb CSV file with approximately 30 columns.

What I'd like to do is to search through the file and REMOVE any lines with a certain case insensitive string in any of the columns:

So my file looks like this:

1, Mike Smith, 12, Philly
2, John Smith, Right, New York
3, Tommy, $@, Atlanta
4, Nate New, $@, Atlanta

I'd like to search through this file, and remove any line with the word "new" in it, so my final file would look like this:

1, Mike Smith, 12, Philly
3, Tommy, $@, Atlanta

Perl

perl -ne 'print unless /new/i' lokhtar.example

Gnu sed

sed '/new/Id' lokhtar.example

sed

sed '/[Nn][Ee][Ww]/d' lokhtar.example

AWK

awk 'BEGIN{IGNORECASE=1} ! /new/' lokhtar.example

grep

grep -iv new lokhtar.example

Ruby

ruby -pe 'next if /new/i' lokhtar.example

Aia has already given you a rather extensive collection of how to tackle this in various script languages and text processing programs. Still, you fall here for one thing most newbies don't take into account. Hence, in the hope to make you aware of a problem you may have already now or maybe only in other similar problems, here it goes:

Your problem is the lack of a definition of what a "word" constitutes. Take, for example, Aias grep -solution:

grep -iv new lokhtar.example

What this does is to search for lines containing the (-i, case insensitive) sequence n-e-w and filter these lines out (-v). Consider the following lines:

new
bla
Newell

The command will filter out line 1 and 3 but chances are you might only want it to filter line 1. This is because grep doesn't deal with "words" on an instinctive level like you do, it deals with characters and sequences of characters. And if you want to make it understand what "word" means, you need to tell it.

Here are a few (naive) tries and why they will not always do what they are supposed to do:

1) we could start by adding empty space (blanks or tabs) before and after the word we search for. Instead of "new" we could search for "<blank-or-tab>new<blank-or-tab>". This will work in the middle of a line, but fail if the word is the first or last in a line.

2) look at the following sentences, all containing the word "new" and neither as last nor first word - and still the pattern from 1) would fail to recognize them:

This is new, this is different!
Something new: a word followed by a colon.
Should composites like "new-old" be considered?
Is "new" in quotes still considered the word we look for?

Bottom line: you will have to answer for yourself what exactly you consider to be "the word 'new'" before you can construct an accordig pattern you can search for - whatever you decide can be phrased as regular expression - but you need to decide first, what your decision is.

I hope this helps.

bakunin

1 Like

Thank you guys - you guys are amazing!! Now I just need an inverse of that too - e.g. remove any lines that does NOT have "new" in it. Any one of those scripting/shell languages will be fine.

As for Bakunin, that's very helpful - thank you. I want to keep any line with the letters "new" whether it is a whole word or part of a word, eg if it's "newton", I'd want that line kept.

Thank you again guys for your amazing help!

This is amazingly easy: just remove the -v option from the grep :

grep -iv new  /some/file       # output all lines NOT having "new" in them
grep -i  new  /some/file       # output all lines having "new" in them

In general: look at the man page of commands you are not sure about:

man grep         # displays the man page for grep

Unlike Windozw, where "help" is trying to tell you things you don't want to know using methods you detest to reach goals you are not after in first place (the usual modus operandi) the man pages of UNIX systems are for reference - they will not teach you things you don't know, but if you need a detail you can be sure to find it there. This goes especially for options to commands and what they do.

I hope this helps.

bakunin