Using sed's hold-space to filter file contents

ivanbrennan · August 1, 2017, 11:25pm

I wrote an awk script to filter "uninteresting" commands from my ~/.bash_history (I know about HISTIGNORE, but I don't want to exclude these commands from my current session's history, I just want to avoid persisting them across sessions).

The history file can contain multi-line entries with embedded newlines, and entries are separated by timestamps. Given an input file like:

#1501304269
git stash
#1501304270
ls
#1501304318
ls | while IFS= read line; do
echo 'line is: ' $line
done

the script filters out single-line ls, man, and cat commands, producing:

#1501304269
git stash
#1501304318
ls | while IFS= read line; do
echo 'line is: ' $line
done

Notice that multi-line entries are unfiltered -- I figure if they're interesting enough to warrant multiple lines, they're worth remembering.

I've been reading about Sed's multiline capabilities and I'm curious how its hold-space and pattern-space might be manipulated to acheive the same filtering as my Awk script. Rather than use Gnu-sed's -z flag to treat the whole file as a single massive pattern space, I'm looking for a solution that uses commands such as h,H,x,G,N,etc. to accumulate lines in the hold space and swap/delete lines as necessary.

Here's the Awk script:

/^#[[:digit:]]{10}$/ {
  timestamp = $0
  histentry = ""
  next
}
$1 ~ /^(ls?|man|cat)$/ {
  if (! timestamp) {
    print
  } else {
    histentry = $0
  }
  next
}
timestamp {
  print timestamp
  timestamp = ""
}
histentry {
  print histentry
  histentry = ""
}
{ print }

Scrutinizer · August 2, 2017, 3:34am

Since you are only excluding single line commands, you could just peak ahead one line using the N command and only leave out those entries:

sed '/^#[[:digit:]]\{10\}$/{N; /\nls$/d; /\nman$/d; /\ncat$/d;}' file

or
with GNU sed or BSD sed:

sed -E '/^#[[:digit:]]{10}$/{N; /\n(ls|man|cat)$/d;}' file

ivanbrennan · August 2, 2017, 9:29am

Hm... peaking ahead one line won't let me distinguish a single-line command (which should be excluded if it contains ls|cat|man) from the beginning of a multiline command (which should be kept even if it contains ls|cat|man).

For example, if the exclusion pattern was "xxx", the following input,

#0000000001
aaa
#0000000002
xxx
bbb
#0000000003
ccc

would result in this output:

#0000000001
aaa
bbb
#0000000003
ccc

The second record should have passed through unmodified since it has multiple lines, but instead it's head was removed and the rest got tacked onto the previous record.

I was thinking something like, when you reach a timestamp, exchange pattern-space with hold-space (x). Now hold-space is ready to start accumulating the oncoming entry and pattern-space holds whichever entry was previously accumulated. I should be able to perform whatever substitution is necessary on pattern-space now to filter out commands I'm not interested in, since I have the full entry. That gets complicated a bit trying to correctly handle the first and last lines of the file.

My latest failed attempt:

1,/^#[[:digit:]]{10}$/ {
  /^#[[:digit:]]{10}$/! {
    p
    d
  }
}

/^#[[:digit:]]{10}$/ {
  x
  /^$/ d
  /\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d
  p
}

/^#[[:digit:]]{10}$/ !{
  H
  d
}

$ {
  x
  /\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d
  p
}

ctac · August 2, 2017, 5:57pm

Hi,
Your awk script is'nt ok here.
I must change the first line.

/^#[0-9][0-9]*$/ {
uname -a
Linux debian-linux 4.11.0-1-amd64 #1 SMP Debian 4.11.6-1 (2017-06-19) x86_64 GNU/Linux
awk -Wv
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF             32767
sprintf buffer      2040

You can try that with sed, I think it's ok.

sed -n '/^#[0-9]\{10\}$/{:A;/\(ls *$\)\|\(\ncat \)\|\(\nman \)/b;$p;N;/\n#[0-9]\{10\}$/!bA;h;s/\(^.*\)\(\n.*$\)/\1/;p;x;s/.*\n//;bA}' lefile

cat & man with space (ie cat lefile or man tr)
It's more hard with ls.

ivanbrennan · August 2, 2017, 8:20pm

Apparently mawk doesn't support regex repetitions, and maybe not POSIX character classes either.

I couldn't get the desired results from your sed snippet. Not sure why though.

---------- Post updated at 08:20 PM ---------- Previous update was at 08:12 PM ----------

I finally came up with something that works. It's nasty, and I don't doubt there's a better way, but it was satisfying to at least get something working.

$ {
  1 h
  1!H
  x
  /^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d
  p
}

/^#[[:digit:]]{10}$/ !{
  1 h
  1!H
  d
}

/^#[[:digit:]]{10}$/ {
  x
  /^$/ d
  /^#[[:digit:]]{10}$/ d
  /^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d
}

I benchmarked it against my original awk script, as well as against the following gsed script:

gsed -z -E 's/(#[0-9]{10}\n(cat|ls?|man)([^[:alnum:]][^\n]*)?\n)+(#[0-9]{10}\n|$)/\4/g' histfile

Run on a ~50,000 line file, I get the following results:

sed: 80 milliseconds
awk: 70 milliseconds
gsed: 60 milliseconds

Scrutinizer · August 3, 2017, 2:10am

Indeed the mawk version that gets installed by distributions supports neither. I think the latest version does, but you would need to get the source and compile yourself..

--
Your approach seems to also leave out one line commands that do not contain ls man or cat.

MadeInGermany · August 3, 2017, 2:14am

Because d directly jumps to the next cycle, and the input line is not modified in the condition branch, the following code does not need a negated condition.

/^#[[:digit:]]{10}$/ !{
  1 h
  1!H
  d
}

x
/^$/ d
/^#[[:digit:]]{10}$/ d
/^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d