Manipulating word based off of contents

ryanfx · November 19, 2009, 1:47pm

Hello everyone, my first post here, please feel free to inform me if my question can be better formatted so my future posts may be more clear.

I have a large text file which I need parsed in one specific way, I have done the rest of the processing, I am simply lacking the last aspect of such.

I need to look for lines in a file that begin with 'Key:' and of those lines that begin with key, delete all words in that line that do not contain the character '@', ignoring the first word (Key:)

So for example:

EDIT: Fixed example

before:

...
Key: hsfkfs,  sdf@fsdfsdfm,  sfsfdsdfk...334, tester@3, joe
Joe:  hp-0098, fhnjf@jhgsg,  shg@hjgppp
...

after:

...
Key:  sdf@fsdfsdfm, tester@3,
Joe:  hp-0098, fhnjf@jhgsg,  shg@hjgppp
...

I would prefer my solution be stream processing based (simply using an awk or sed command) instead of for-looping through the file, however I do not know if that is possible. I have explored possibilities with many tools but I am running out of ideas.

Any help is appreciated.

Scott · November 19, 2009, 2:16pm

sed "/^Key/s/\w[a-z0-9]*@[a-z0-9]*[ ,]*//g" file1

Key: hsfkfs,  sfsfdsdfk...334, joe
Joe:  hp-0098, fhnjf@jhgsg,  shg@hjgppp

ryanfx · November 19, 2009, 2:21pm

I'm terribly sorry my example was actually reversed - this needs to the other way around (my original text was correct). It needs to LEAVE all words with @ in it, and delete everything else. I'm sorry for the error.

Is there a way to reverse results (much like grep -v)?

Scott · November 19, 2009, 2:27pm

No probs.

Here's some horrible awk, until I fix your sed!

awk '
  /^Key/ {
    printf $1;
    for( I = 2; I <= NF; I++ )
      if ( $I ~ /@/ ) printf " " $I
    print ""
    next
  }
  1
' file1

Key: sdf@fsdfsdfm, tester@3,
Joe:  hp-0098, fhnjf@jhgsg,  shg@hjgppp

m1xram · November 19, 2009, 2:27pm

What is the purpose of the filter? It looks like a good way to collect emails for spamming.

ryanfx · November 19, 2009, 2:33pm

I gave the @ symbol as an ambiguous symbol; its meaning was simply an example. If I have to come to unix forums to help me parse large ammounts of text to enhance my spamming capabilities one would conclude I am a terrible spammer!

I also see your sed statements work off a-z, 0-9; - how would one account for characters that are shift + numbers as well (possibly) being in the words such as '#$%-' while unsure of their exact order therein? Can you block special characters like [!-&] much like the alpha [a-z]? Does one have to do so in correct ASCII order?

I appreciate all of your help!

m1xram · November 19, 2009, 4:12pm

# Filter out non @ words on Key: records.

cat FILE | sed '/^Key:/s/ \+[^@ ]\+[@][^@ ]\+ \?//g'

# This is crude though, an AWK solution that splits and trims first by ':'
# and then by ',' would be much more exact. Will work on it.

---------- Post updated at 02:12 PM ---------- Previous update was at 12:44 PM ----------

# Ok, how about a PERL solution?
# Save below program as SOMETHING.pl
chmod +x SOMETHING.pl

# Example:
cat FILE | ./SOMETHING.pl -v -p '@'
# -v, Reverse match.
# -p PATTERN, Default is '@'
----------------------------------------------------------

#!/usr/bin/perl 

use Getopt::Std;

my (@in, @word, @out);
my $pat;
my $reverse = 0;

getopts("vp:");
if (defined($opt_v)) {
    $reverse = 1;
}
if (defined($opt_p)) {
    $pat = $opt_p;
} else {
    $pat = '@';
}

while(<STDIN>) {
    $line = $_;
    chomp $line;
    if ($line =~ /^Key:/) {
    @in = ();
    @in = (split(/: */, $line));
    @word = ();
    @word = (split(/, */, $in[1]));
    @out = ();
    foreach my $item (@word) {
        if ($reverse eq 0 && ($item =~ m/$pat/)) {
        push(@out, $item);
        } elsif ($reverse eq 1 && ($item !~ m/$pat/)) {
        push(@out, $item);
        }
    }
    print "$in[0]: ", join(", ", @out), "\n";
    }
    else {
    print "$line\n";
    }
}
exit 0;

steadyonabix · November 19, 2009, 5:40pm

This seems to work: -

nawk '
  { ln = $0}
  /^Key:/{
    ln="Key: "
    for( i = 1; i <= NF; i++)
      if (match($i, /@/))
        ln = ln" "$i
  }
  { sub(/^ /, "", ln)
    print ln
  }
' infile

ryanfx · November 19, 2009, 7:26pm

Thank you everyone, this was great information!

Just out of curiosity, I think the first sed command would have actually worked if the action were negated. (Since it seemed to do the exact opposite thing required). Does anyone have an idea of how that action would be properly negated? or would you negate the string regex instead and keep the action?

I would love to know just to further my own knowledge.