Extract words starting with a pattern from a file

Pramod_009 · September 21, 2013, 12:40pm

Hi Guys..

I have a file and i want to extract all words that starts with a pattern 'ABC_' or 'ADF_'
For example,

ABC.txt
----

INSERT INTO ABC_DLKFJAL_FJKLD
SELECT DISTINCT S,B,C FROM ADF_DKF_KDFJ_IERU8 A, ABC_LKDJFREUE9_FJKDF B
WHERE A.FI=B.EI;
COMMIT;

Output :

ABS_DLKFJAL_FJKLD, ADF_DKF_KDFJ_IERU8, ABS_LKDJFREUE9_FJKDF

Hope the query is clear.
Thanks in Advance.
Pramod

elixir_sinari · September 21, 2013, 12:57pm

Try:

perl -lne 'push @w, /\bA(?:BC|DF)_\w+/g; END{ print join ", ", @w if @w}' ABC.txt

Yoda · September 21, 2013, 1:15pm

Another approach in bash:

#!/bin/bash

while read line
do
        for word in $line
        do
                if [[ "$word" =~ ^ABC_ ]] || [[ "$word" =~ ^ADF_ ]]
                then
                        [ -z "$str" ] && str="$word" || str="${str},${word}"
                fi
        done
done < ABC.txt

echo "$str"

Don_Cragun · September 21, 2013, 1:39pm

Note that you said that you wanted the output:

ABS_DLKFJAL_FJKLD, ADF_DKF_KDFJ_IERU8, ABS_LKDJFREUE9_FJKDF

but you said you wanted to match ABC_ ; not ABS_ and ABS_ does not appear in your sample input.

If the strings you want to match could contain characters that have special meanings in regular expressions or if you want varying strings with different invocations of your script, here is a way to do it using awk:

printf '%s\n' 'ABC_' 'ADF_' | awk '
FNR == NR {
        # Read patterns to be matched from standard input.
        p[++pc] = $0
        next
}
{       # For every field in the current line...
        for(i = 1; i <= NF;i++)
                # If this field has not already been printed...
                if(!($i in m))
                        # See if this field matches any of the given patterns..
                        for(j = 1; j <= pc; j++)
                                if(substr($i, 1, length(p[j])) == p[j]) {
                                        m[$i]
                                        printf("%s%s", mc ? ", " : "", $i)
                                        mc = 1
                                }
}
END {   printf("%s\n", mc ? "" : "No matches found.");
}' - ABC.txt

As shown, this script will only print one occurrence of each matched word. If you delete the lines shown in red, it will print every occurrence of each matched word. (Your sample input didn't contain duplicates of any words starting with the strings to be matched, but if you duplicate the input, you'll see the difference.)

As always, if you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

Pramod_009 · September 21, 2013, 1:56pm

Thanks Elixir and Don for the help, the code is working perfectly fine.
I apologise for my mistake Don, it should be ABC_ or ADF_. But i got your point too.

disedorgue · September 21, 2013, 2:14pm

A sed version:

sed ':deb;${s/,\|\n/ /g;s/ \(ABC\|ADF\)/,\1/g;s/[^,]*\(,[^ ]*\)/\1/g;s/^,\| .*//g;b};N;bdeb' ABC.txt

Regards.