Using sed's hold-space to filter file contents

I wrote an awk script to filter "uninteresting" commands from my ~/.bash_history (I know about HISTIGNORE, but I don't want to exclude these commands from my current session's history, I just want to avoid persisting them across sessions).

The history file can contain multi-line entries with embedded newlines, and entries are separated by timestamps. Given an input file like:

git stash
ls | while IFS= read line; do
echo 'line is: ' $line

the script filters out single-line ls, man, and cat commands, producing:

git stash
ls | while IFS= read line; do
echo 'line is: ' $line

Notice that multi-line entries are unfiltered -- I figure if they're interesting enough to warrant multiple lines, they're worth remembering.

I've been reading about Sed's multiline capabilities and I'm curious how its hold-space and pattern-space might be manipulated to acheive the same filtering as my Awk script. Rather than use Gnu-sed's -z flag to treat the whole file as a single massive pattern space, I'm looking for a solution that uses commands such as h,H,x,G,N,etc. to accumulate lines in the hold space and swap/delete lines as necessary.

Here's the Awk script:

/^#[[:digit:]]{10}$/ {
  timestamp = $0
  histentry = ""
$1 ~ /^(ls?|man|cat)$/ {
  if (! timestamp) {
  } else {
    histentry = $0
timestamp {
  print timestamp
  timestamp = ""
histentry {
  print histentry
  histentry = ""
{ print }

Since you are only excluding single line commands, you could just peak ahead one line using the N command and only leave out those entries:

sed '/^#[[:digit:]]\{10\}$/{N; /\nls$/d; /\nman$/d; /\ncat$/d;}' file

with GNU sed or BSD sed:

sed -E '/^#[[:digit:]]{10}$/{N; /\n(ls|man|cat)$/d;}' file

Hm... peaking ahead one line won't let me distinguish a single-line command (which should be excluded if it contains ls|cat|man) from the beginning of a multiline command (which should be kept even if it contains ls|cat|man).

For example, if the exclusion pattern was "xxx", the following input,


would result in this output:


The second record should have passed through unmodified since it has multiple lines, but instead it's head was removed and the rest got tacked onto the previous record.

I was thinking something like, when you reach a timestamp, exchange pattern-space with hold-space (x). Now hold-space is ready to start accumulating the oncoming entry and pattern-space holds whichever entry was previously accumulated. I should be able to perform whatever substitution is necessary on pattern-space now to filter out commands I'm not interested in, since I have the full entry. That gets complicated a bit trying to correctly handle the first and last lines of the file.

My latest failed attempt:

1,/^#[[:digit:]]{10}$/ {
  /^#[[:digit:]]{10}$/! {

/^#[[:digit:]]{10}$/ {
  /^$/ d
  /\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d

/^#[[:digit:]]{10}$/ !{

$ {
  /\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d

Your awk script is'nt ok here.
I must change the first line.

/^#[0-9][0-9]*$/ {
uname -a
Linux debian-linux 4.11.0-1-amd64 #1 SMP Debian 4.11.6-1 (2017-06-19) x86_64 GNU/Linux
awk -Wv
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF             32767
sprintf buffer      2040

You can try that with sed, I think it's ok.

sed -n '/^#[0-9]\{10\}$/{:A;/\(ls *$\)\|\(\ncat \)\|\(\nman \)/b;$p;N;/\n#[0-9]\{10\}$/!bA;h;s/\(^.*\)\(\n.*$\)/\1/;p;x;s/.*\n//;bA}' lefile

cat & man with space (ie cat lefile or man tr)
It's more hard with ls.

Apparently mawk doesn't support regex repetitions, and maybe not POSIX character classes either.

I couldn't get the desired results from your sed snippet. Not sure why though.

---------- Post updated at 08:20 PM ---------- Previous update was at 08:12 PM ----------

I finally came up with something that works. It's nasty, and I don't doubt there's a better way, but it was satisfying to at least get something working.

$ {
  1 h
  /^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d

/^#[[:digit:]]{10}$/ !{
  1 h

/^#[[:digit:]]{10}$/ {
  /^$/ d
  /^#[[:digit:]]{10}$/ d
  /^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d

I benchmarked it against my original awk script, as well as against the following gsed script:

gsed -z -E 's/(#[0-9]{10}\n(cat|ls?|man)([^[:alnum:]][^\n]*)?\n)+(#[0-9]{10}\n|$)/\4/g' histfile

Run on a ~50,000 line file, I get the following results:

  • sed: 80 milliseconds
  • awk: 70 milliseconds
  • gsed: 60 milliseconds
1 Like

Indeed the mawk version that gets installed by distributions supports neither. I think the latest version does, but you would need to get the source and compile yourself..

Your approach seems to also leave out one line commands that do not contain ls man or cat.

Because d directly jumps to the next cycle, and the input line is not modified in the condition branch, the following code does not need a negated condition.

/^#[[:digit:]]{10}$/ !{
  1 h

/^$/ d
/^#[[:digit:]]{10}$/ d
/^#[[:digit:]]{10}\n(ls?|cat|man)([^[:alnum:]][[:print:]]*)?$/ d
1 Like