Trying to find if a file content/output contains phrases which may contain special chartacters

centosadmin · January 30, 2024, 10:23am

Hello,

SHORT DESCRIPTION:
I have a bash script command:
unwanted=$(cat /dev/shm/lastmessage|grep -oE '01[[:alnum:]]{64}|B01[[:alnum:]]{64}[[:punct:]]?|/abc/|badphrase|$1' 2>/dev/null)
and since number of phrases that grep is checking is increasing, i wanted to move phrases (maybe not those that contain alnum since it needs to be grep treated differently than the rest) into a separate file, one per line, while that file phrases may contain ANY special personajes like ", ’ \ / ; $ ?
It should be efficient even for a huge list of bad phrases and described way? Any idea how the command may look like?

REST IS A LONG DESCRIPTION, POSSIBLY NOT NEEDED TO SPEND TIME READING IT:
I am hosting a community for an anonymous messenger called Session and its profanity blocklist function is badly made not to block phrases containing special personajes plus it cause slow start of the server if profanity list is too long. Developers are inactive so since i do not know how to prevent SQLITE3 INSERT command containing bad phrase, i am looking for a way to delete bad phrase containing message post-INSERT. Not optimal, but better than spam.

Currently my bash script contains following code to insert last message content to a file:

# discover last message in a DB table and insert it into a file
for i in 1 2 3 4 5; do sqlite3 /var/lib/session-open-group-server/sogs.db 'SELECT * FROM message_details ORDER BY id DESC LIMIT 1;' && echo "Success selecting msg" && break || echo "Failure selecting msg, attempt $i/5" && sleep 0.1; done > /dev/shm/lastmessage

Then the script follows, I have tried two variants and none is working:

A)

mapfile -t phrases < /var/lib/session-open-group-server/profanity.txt
pattern=$(printf "%s|" "${phrases[@]}")
unwanted=$(grep -oF "$pattern|$1" /dev/shm/lastmessage 2>/dev/null)

B)

phrases=$(cat /var/lib/session-open-group-server/profanity.txt)
unwanted=$(grep -oF "$phrases|$1" /dev/shm/lastmessage 2>/dev/null)

Inside a profanity.txt file, i have:

specphrasee"'e/d\df
01[[:alnum:]]{64}
B01[[:alnum:]]{64}[[:punct:]]?
/abc/
badphrase

2nd and 3rd phrase is something that may be called regular expressions, so i probably need to separate it from others that should be treated as a fixed string.

QUESTION: do you have idea how to solve this, how the bash script code should look like?

What works for me is:
unwanted=$(cat /dev/shm/lastmessage|grep -oE '01[[:alnum:]]{64}|B01[[:alnum:]]{64}[[:punct:]]?|/abc/|badphrase|$1' 2>/dev/null)

yet I wanted to make longer list of phrases, so it seems more práctico to have these inside a separate file one per line. The whole task of checking lets say 5MB big file of thousands of phrases against output (last posted message) should be CPU/memory/disk efficient, since it will be done every couple of seconds.

MadeInGermany · January 30, 2024, 10:50am

You can format it on several lines:

unwanted=$(
  grep -Eo '01[[:alnum:]]{64}
B01[[:alnum:]]{64}[[:punct:]]?
/abc/
badphrase
$1' < /dev/shm/lastmessage
)

Now it's easy to add more phrases.
Be careful to not have an empty line!

BTW in the shell a 'string' quotes everything inside, only a ' needs to be escaped '\'' (tick backslash tick tick).

If you want to use a text file (it's newline-separated):

grep -Eof /var/lib/session-open-group-server/profanity.txt < /dev/shm/lastmessage

Last but not least, the following is okay if you stick to the newline separator:

phrases=$(cat /var/lib/session-open-group-server/profanity.txt)
unwanted=$(
  grep -Fo "$phrases
$1" /dev/shm/lastmessage
)

Where in a "string" the shell substitutes $-expressions i.e. $phrases and $1. A literal $ must be escaped \$
And grep -F, unlike grep -E, would not interpret RE phrases.
Unfortunately the modus operandi -F (fgrep) or -E (egrep) is exclusive.

MadeInGermany · January 30, 2024, 2:23pm

Perhaps you want the following.?
Two distinct files for plain phrases and RE patterns.

#!/bin/bash
# Pass plain phrases file and RE patterns file to awk variables
awk -v plainphrases="./plainphrases" -v repatterns="./repatterns" '
  BEGIN {
# Plain array
    while (getline line < plainphrases) { pp[++ppi]=line }
# RE array
    while (getline line < repatterns) { rp[++rpi]=line }
  }
# Main loop
  {
# Plain matches, eliminate duplicates
    for (p in pp) if (index($0, pp[p])) { uniq[pp[p]] }
# RE matches, eliminate duplicates
    for (r in rp) if (match($0, rp[r])) { uniq[substr($0, RSTART, RLENGTH)] }
  }
  END {
# Print the collected results
    for (u in uniq) { print u }
  }
' "./inputfile"

Instead of immediate printing (like grep does) it stores the output in another array named uniq, eliminating duplicates. It is printed at the END.

centosadmin · January 31, 2024, 2:07pm

Thank you @MadeInGermany for the suggestions, I have not used the suggestion from the 1st post

grep -Eof /var/lib/session-open-group-server/profanity.txt < /dev/shm/lastmessage

since somehow it was unable to match specphrasee"'e/d\df but I have used one from the second post even second one looks more complicated. I only had to replace "awk" by "LC_ALL=C awk" otherwise it complained "Invalid multibyte data detected. There may be a mismatch between your data and your locale" (file contains some kind of binary/non-printable characters)

I could also "wrap" whole awk command into a variable, meaning that it started:

output=$(LC_ALL=C awk
and ended:
' "./inputfile")

and then i was checking if $output is not empty and executing command if not, expecting that possible STDERR won't be incluided in $output

It appears to be working. Thank you

MadeInGermany · January 31, 2024, 2:29pm

You are welcome.
Regarding the specphrasee"'e/d\df
grep -E interprets the \d as an escaped d ; the \ is omitted.
You would need \\ to be treated as a \

"Invalid multibyte" is an interesting finding. Maybe a bug? You found the work-around!

Matt-Kita · January 31, 2024, 2:38pm

Either this, or try reading profanity.txt from the second row onwards, e.g.

grep -Eof <(tail -n +2 /var/lib/session-open-group-server/profanity.txt) < /dev/shm/lastmessage

No, stderr is never included (by default) in a value of variable assigned through command substitution, that's unless it's explicitly redirected, with e.g. var=$(ENV_VAR=spec command "args" 2>&1)

bendingrodriguez · January 31, 2024, 10:24pm

Hi @centosadmin,

you could scan the db and limit the search to only those messages that have arrived since the last scan. I guess there aren't thousands of new messages arriving every minute? Then you could start with this first idea:

#!/bin/bash
#
# no lockfile/systemd.service yet

ARG=$1
DBFILE="dbfile"
IDFILE="idfile"
BADFILE="badfile"
LOGFILE="logfile"
SLEEP=5

[[ -s $IDFILE ]] || echo 0 > $IDFILE

while :; do
    id_last=$(< $IDFILE)
    while read -r id msg; do
        grep -q -E -f $BADFILE <<< "$msg" && echo "$id;$msg"
        echo "$id" > $IDFILE
    done < <(sqlite3 $DBFILE "select id, msg from messages where id > $id_last order by id") >> $LOGFILE
    [[ ! $ARG =~ 1 ]] && sleep $SLEEP || break
done

For testing, copy the db to your working dir, adjust the paths & the sql command and start the script via bash script.sh -1 (oneshot). This could take some hour, depending of the db's and the pattern file's size. But you could put a msg id in $IDFILE, so that the scan starts at that id. You could also add some debug info like printing timestamps or the current msg id to stderr.
A 2nd run with -1 should run (much) faster. Then start it without arg so that it runs forever (abort with CTRL-c). Test it on another console by using sqlite to insert a msg containing a badphrase to the db copy. Of course grep takes its time (even it's very fast), parsing against thousands of patterns resp regexes is heavy work for the CPU.

The script doesn't delete any messages (yet), it only writes to $LOGFILE. That could be automated, but automatic deletion is a bit critical due to possible false positives. Remember: When testing, always use the db copy.

And, of course constantly opening & closing the db is not efficient. And I couldn't test the script, cause I don't have a sogs-db nor a big pattern file at hand.

system · November 26, 2024, 10:24pm

This topic was automatically closed 300 days after the last reply. New replies are no longer allowed.

hicksd8 · January 17, 2025, 10:32am

As per OP request. Wants to add update.

centosadmin · January 17, 2025, 12:29pm

This Linux bash script can be used to tell if file $filetocheck contains some of the strings(which can consist of possibly any special character) defined in any of the two files (file1, file2). Each string in these two files should be on its own line.:

# repatternsfile should contain regular expressions 
repatternsfile="/path/to/file1"
# literalstringsfile can contain phrases with a special characters, one per line
literalstringsfile="/path/to/file2"

# filetocheck constains data which we want to check for presence of a phrases listed in above defined files
filetocheck="/dev/shm/file"
echo "I can populate that file with some test data" > "${filetocheck}"

# check file for existence of a phrases listed in files $repatternsfile and $literalstringsfile
found=$(grep -a -E -i -o -m 1 -f $repatternsfile $filetocheck|head -n 1) # Search using extended regexps and set first found string into a variable
if [[ ! "$found" ]]; then # not found matching phrase from first file, try phrases from another file
found=$(grep -a -F -i -o -m 1 -f $literalstringsfile $filetocheck|head -n 1) # Search using literal strings and set first found string into a variable
#       GREP switches explained:
#    -a: This switch is used to treat binary files as text files.
#    -E: This switch is used to enable extended regular expressions.
#    -i: This switch is used to perform case-insensitive matching.
#    -o: This switch is used to show only the matching part of the line.
#    -m 1: This switch is used to stop after the first match is found.
#    -f ./tmpdel-filererr: This switch is used to specify the file containing the patterns to search for.
fi

if [[ "$found" != "" ]]; then

echo -e "\nFound matching phrase: $found"

fi