Comparing alternate lines of code

cabled · November 12, 2018, 9:56am

Hi gents,

Have only a passing familiarity with linux/shell at this point, so please forgive simple question.

I have text files that have lines something like the following:

a
b
c
d
d
d
e
f
e
f
e
f
a
b
c
d
e
f
etc

I'm trying to remove 2 types of duplicates while preserving line order/format.
1) consecutive duplicate lines
2) alternate lines if they are duplicate

For removing type 1 lines,

cat "$file" | uniq > ./output/"$file"

gives me an output file that looks like

a
b
c
d
e
f
e
f
e
f
a
b
c
d
e
f
etc

which is fine.

I'm kinda stumped about type 2 duplicates though...

Ideally I'd like to get:

a
b
c
d
e
f
a
b
c
d
e
f

Not entirely sure how to compare alternate lines... Any assistance is appreciated

RudiC · November 12, 2018, 10:06am

How about

awk '!($0 == LAST1 || $0 == LAST2); {LAST2 = LAST1; LAST1 = $0}' file
a
b
c
d
e
f
a
b
c
d
e
f

cabled · November 12, 2018, 10:32am

Hi RudiC,

Thanks for the assistance. That works wonderfully. May I ask for some further guidance breaking down the command so I may understand it?

Are we outputting if the current line does not equal the immediately preceding 2 lines (LAST1 || LAST2) and then incrementing the lines for the next iteration?

By extension, this also deals with the type 1 duplicates yes?

RudiC · November 12, 2018, 10:43am

Your analysis / function description is correct. Lines are printed only if they did not show up in the recent two lines.

Type 1 duplicates are handled by the LAST1 comparison, type 2 by LAST2. Then the two variables are sort of cycled through.

Peasant · November 12, 2018, 11:36am

Here is my effort to translate using input
Lines of input are enumerated to for easier grasp and are not in actual file / input the program is processing.

#
# Condition construct is met on line 1
# LAST2 is empty, LAST1 is defined as current processing line, or $0
#

1 a

#
# Condition construct is met on line 2
# LAST2 is defined as LAST1 (previous line), LAST1 as current processing line, or $0
# We do that till line 6, since condition is met, replacing the values of LAST1 / LAST2 accordingly.
#

2 b
3 c
4 d
5 e
6 f

#
# In this moment, on line 7, value of LAST1 is "f", while LAST2 is "e".
# Condition construct is not met for lines 7 to 10.
# LAST1/LAST2 do not change, nor those lines will be in output
#

7 e
8 f
9 e
10 f

#
# On line 11 LAST1 or LAST2 condition construct is met again.
# LAST2 is declared as "f", and LAST1 as "a" or $0 or current processing line
# The program continues to operate as above.
#

11 a
12 b
13 c
14 d
15 e
16 f

Hopefully that is correct.
Regards
Peasant.

rdrtx1 · November 12, 2018, 6:10pm

awk '$0 != line[NR-2] && $0 != line[NR-1]; {line[NR]=$0}' infile

MadeInGermany · November 14, 2018, 11:27am

The solution is a lookup buffer of two, implemented by the two variables LAST1 and LAST2.
The following has a configurable buffer depth

awk '
{
  # preset: print
  prt=1
  # dont print if found in buf
  for (i=1; i<=d; i++) if (buf[i%d]==$0) {
    prt=0
    break
  }
  if (prt==1) print $0
  buf[NR%d]=$0
}
' d=2 file

With d=1 it will detect the repetition d d but not the e f e f
With d=3 it would also detect a repetition g h i g h i ...