Remove lines with duplicate first field

ajp7701 · March 17, 2012, 6:15pm

Trying to cut down the size of some log files. Now that I write this out it looks more dificult than i thought it would be.

Need a bash script or command that goes sequentially through all lines of a file, and does this:

if field1 (space separated) is the number 2012 print the entire line. Do this DEFINITELY ALWAYS.

if field1 is not the number 2012, follow this rule:

if field1 of current line is same as field1 of previous line, DONT print the line, otherwise DO print the line.

Another way of saying the rule is:
only if field1 of current line is DIFFERENT than field1 of the previous line, print entire line (except 2012, always print lines with 2012 for field1)

balajesuri · March 17, 2012, 10:02pm

Please provide a sample input and desired output.

agama · March 17, 2012, 11:24pm

Per the description:

awk ' /^2012 / { print; x = "";  next; } $1 != x { x = $1; print; } ' input-file >output-file

ajp7701 · March 18, 2012, 10:42am

---

input

2012 aaa bbb cccc ddd
2012 eee fff ggg hhh
XYZ aaa bbb ccc ddd
XYZ eee fff ggg hhh <---remove this line
2012 hhh iii jjj
2012 hhh iii 123
ABC mmm nnn ooo
ABC ppp qqq rrr <---remove this line
ABC www xxx yyy <--remove this line
2012 mmm nnn ooo
ABC sss ttt uuu

output

2012 aaa bbb cccc ddd
2012 eee fff ggg hhh
XYZ aaa bbb ccc ddd
2012 hhh iii jjj
2012 hhh iii 123
ABC mmm nnn ooo
2012 mmm nnn ooo
ABC sss ttt uuu

---
It keeps lines that start with 2012 but gets rid of lines where field1 is the same as field1 of the previous line.

---
Also, thank you agama for the code I will check it out on my data. Really appreciated the replies!! Yall are so awesome!

Franklin52 · March 18, 2012, 10:53am

Try this:

awk '!/^2012/ && $1==s{s=$1;next}{s=$1}1' file

Scrutinizer · March 18, 2012, 10:58am

awk '/^2012/ || $1!=p; {p=$1}' infile

ajp7701 · March 18, 2012, 11:32am

Ok, these are all great! AWK is great for this. It amazes me how smart and elegant everyone is with awk.

Now, what if I want to print a number for each time a duplicate field1 was removed? For example, the above output would be something like:

2012 aaa bbb cccc ddd
2012 eee fff ggg hhh
XYZ aaa bbb ccc ddd (1)
2012 hhh iii jjj
2012 hhh iii 123
ABC mmm nnn ooo (3)
2012 mmm nnn ooo
ABC sss ttt uuu

Probably going to be dificult and require more of a script than a command. But I still am very pleased with the awk. Thanks everyone TOO MUCH!

Scrutinizer · March 18, 2012, 11:49am

awk 'END{print RS} p==$1 && !/^2012/{i++; next} i{print " (" i ")"; i=0} NR>1{print RS}{p=$1}1' ORS= infile