Trying to find if a file content/output contains phrases which may contain special chartacters

Hello,

SHORT DESCRIPTION:
I have a bash script command:
unwanted=$(cat /dev/shm/lastmessage|grep -oE '01[[:alnum:]]{64}|B01[[:alnum:]]{64}[[:punct:]]?|/abc/|badphrase|$1' 2>/dev/null)
and since number of phrases that grep is checking is increasing, i wanted to move phrases (maybe not those that contain alnum since it needs to be grep treated differently than the rest) into a separate file, one per line, while that file phrases may contain ANY special personajes like ", ’ \ / ; $ ?
It should be efficient even for a huge list of bad phrases and described way? Any idea how the command may look like?

REST IS A LONG DESCRIPTION, POSSIBLY NOT NEEDED TO SPEND TIME READING IT:
I am hosting a community for an anonymous messenger called Session and its profanity blocklist function is badly made not to block phrases containing special personajes plus it cause slow start of the server if profanity list is too long. Developers are inactive so since i do not know how to prevent SQLITE3 INSERT command containing bad phrase, i am looking for a way to delete bad phrase containing message post-INSERT. Not optimal, but better than spam.

Currently my bash script contains following code to insert last message content to a file:

# discover last message in a DB table and insert it into a file
for i in 1 2 3 4 5; do sqlite3 /var/lib/session-open-group-server/sogs.db 'SELECT * FROM message_details ORDER BY id DESC LIMIT 1;' && echo "Success selecting msg" && break || echo "Failure selecting msg, attempt $i/5" && sleep 0.1; done > /dev/shm/lastmessage

Then the script follows, I have tried two variants and none is working:

A)

mapfile -t phrases < /var/lib/session-open-group-server/profanity.txt
pattern=$(printf "%s|" "${phrases[@]}")
unwanted=$(grep -oF "$pattern|$1" /dev/shm/lastmessage 2>/dev/null)

B)

phrases=$(cat /var/lib/session-open-group-server/profanity.txt)
unwanted=$(grep -oF "$phrases|$1" /dev/shm/lastmessage 2>/dev/null)

Inside a profanity.txt file, i have:

specphrasee"'e/d\df
01[[:alnum:]]{64}
B01[[:alnum:]]{64}[[:punct:]]?
/abc/
badphrase

2nd and 3rd phrase is something that may be called regular expressions, so i probably need to separate it from others that should be treated as a fixed string.

QUESTION: do you have idea how to solve this, how the bash script code should look like?

What works for me is:
unwanted=$(cat /dev/shm/lastmessage|grep -oE '01[[:alnum:]]{64}|B01[[:alnum:]]{64}[[:punct:]]?|/abc/|badphrase|$1' 2>/dev/null)

yet I wanted to make longer list of phrases, so it seems more práctico to have these inside a separate file one per line. The whole task of checking lets say 5MB big file of thousands of phrases against output (last posted message) should be CPU/memory/disk efficient, since it will be done every couple of seconds.

You can format it on several lines:

unwanted=$(
  grep -Eo '01[[:alnum:]]{64}
B01[[:alnum:]]{64}[[:punct:]]?
/abc/
badphrase
$1' < /dev/shm/lastmessage
)

Now it's easy to add more phrases.
Be careful to not have an empty line!

BTW in the shell a 'string' quotes everything inside, only a ' needs to be escaped '\'' (tick backslash tick tick).

If you want to use a text file (it's newline-separated):

grep -Eof /var/lib/session-open-group-server/profanity.txt < /dev/shm/lastmessage

Last but not least, the following is okay if you stick to the newline separator:

phrases=$(cat /var/lib/session-open-group-server/profanity.txt)
unwanted=$(
  grep -Fo "$phrases
$1" /dev/shm/lastmessage
)

Where in a "string" the shell substitutes $-expressions i.e. $phrases and $1. A literal $ must be escaped \$
And grep -F, unlike grep -E, would not interpret RE phrases.
Unfortunately the modus operandi -F (fgrep) or -E (egrep) is exclusive.

2 Likes

Perhaps you want the following.?
Two distinct files for plain phrases and RE patterns.

#!/bin/bash
# Pass plain phrases file and RE patterns file to awk variables
awk -v plainphrases="./plainphrases" -v repatterns="./repatterns" '
  BEGIN {
# Plain array
    while (getline line < plainphrases) { pp[++ppi]=line }
# RE array
    while (getline line < repatterns) { rp[++rpi]=line }
  }
# Main loop
  {
# Plain matches, eliminate duplicates
    for (p in pp) if (index($0, pp[p])) { uniq[pp[p]] }
# RE matches, eliminate duplicates
    for (r in rp) if (match($0, rp[r])) { uniq[substr($0, RSTART, RLENGTH)] }
  }
  END {
# Print the collected results
    for (u in uniq) { print u }
  }
' "./inputfile"

Instead of immediate printing (like grep does) it stores the output in another array named uniq, eliminating duplicates. It is printed at the END.

1 Like

Thank you @MadeInGermany for the suggestions, I have not used the suggestion from the 1st post

grep -Eof /var/lib/session-open-group-server/profanity.txt < /dev/shm/lastmessage

since somehow it was unable to match specphrasee"'e/d\df but I have used one from the second post even second one looks more complicated. I only had to replace "awk" by "LC_ALL=C awk" otherwise it complained "Invalid multibyte data detected. There may be a mismatch between your data and your locale" (file contains some kind of binary/non-printable characters)

I could also "wrap" whole awk command into a variable, meaning that it started:

output=$(LC_ALL=C awk
and ended:
' "./inputfile")

and then i was checking if $output is not empty and executing command if not, expecting that possible STDERR won't be incluided in $output

It appears to be working. Thank you

1 Like

You are welcome.
Regarding the specphrasee"'e/d\df
grep -E interprets the \d as an escaped d ; the \ is omitted.
You would need \\ to be treated as a \

"Invalid multibyte" is an interesting finding. Maybe a bug? You found the work-around!

Either this, or try reading profanity.txt from the second row onwards, e.g.

grep -Eof <(tail -n +2 /var/lib/session-open-group-server/profanity.txt) < /dev/shm/lastmessage

No, stderr is never included (by default) in a value of variable assigned through command substitution, that's unless it's explicitly redirected, with e.g. var=$(ENV_VAR=spec command "args" 2>&1)

Hi @centosadmin,

you could scan the db and limit the search to only those messages that have arrived since the last scan. I guess there aren't thousands of new messages arriving every minute? Then you could start with this first idea:

#!/bin/bash
#
# no lockfile/systemd.service yet

ARG=$1
DBFILE="dbfile"
IDFILE="idfile"
BADFILE="badfile"
LOGFILE="logfile"
SLEEP=5

[[ -s $IDFILE ]] || echo 0 > $IDFILE

while :; do
    id_last=$(< $IDFILE)
    while read -r id msg; do
        grep -q -E -f $BADFILE <<< "$msg" && echo "$id;$msg"
        echo "$id" > $IDFILE
    done < <(sqlite3 $DBFILE "select id, msg from messages where id > $id_last order by id") >> $LOGFILE
    [[ ! $ARG =~ 1 ]] && sleep $SLEEP || break
done

For testing, copy the db to your working dir, adjust the paths & the sql command and start the script via bash script.sh -1 (oneshot). This could take some hour, depending of the db's and the pattern file's size. But you could put a msg id in $IDFILE, so that the scan starts at that id. You could also add some debug info like printing timestamps or the current msg id to stderr.
A 2nd run with -1 should run (much) faster. Then start it without arg so that it runs forever (abort with CTRL-c). Test it on another console by using sqlite to insert a msg containing a badphrase to the db copy. Of course grep takes its time (even it's very fast), parsing against thousands of patterns resp regexes is heavy work for the CPU.

The script doesn't delete any messages (yet), it only writes to $LOGFILE. That could be automated, but automatic deletion is a bit critical due to possible false positives. Remember: When testing, always use the db copy.

And, of course constantly opening & closing the db is not efficient. And I couldn't test the script, cause I don't have a sogs-db nor a big pattern file at hand.

1 Like