Find pattern in first field of file

krsnadasa · October 17, 2015, 12:24am

Hello all

I have two files.

Pattern.txt - It contains patterns to be matched. It has large number of patterns to be matched.

Cat Pattern.txt

Ram
Shyam
Mohan
Jhon

I have another file which has actual data and records are delimted by single or multiple spaces.

2. Content.txt

Cat Content.txt
1@GU00450012@Ram   @@@@ bla1  lba2.
2@GU11950004@David @@@@ uss  Ram 
3@GU11950004@Shyam @@@@ uss   rupa
etc etc

Now I need to find the pattern in content.txt but only in first field. I tried using

grep -F -f pattern.txt content.txt

It returns me rows like

2@GU11950004@David @@@@ uss  Ram

Becuase it contains pattern called 'Ram' somewhere
It seems to work but it looks the pattern all over the file. I need to restrict the search to first field only. Hen
I know we I can store patterns using awk in array using

NR==FNR

but not sure how to search each of them in content.txt in first field only.

Looking for any help.

Thanks

cjcox · October 17, 2015, 1:23am

How about:

#!/bin/sh

#awkcode=`sed 's,\(.*\),$1 ~ /\1/ { print $0 },' <Pattern.txt`
awkcode=`sed 's,\(.*\),$1 ~ /@\1$/ { print $0 },' <Pattern.txt`

awk "
$awkcode
" <Content.txt

Notice my awkcode line assumes the pattern to match is preceded by @ and must match to end of field. Take a look at the commented out awkcode line if it should match just on the name regardless of where located.

Using the commented out one, the following lines would match for Masters:

8@XXXXXXXX@Masters @@@@ blah masters
8@XXXXXXXX@McMasters @@@@ blah mcmasters

Warning, my quickie solution assumes that the Patterns are pretty simple and do not contain weird characters that might mess up awk.

Aia · October 17, 2015, 2:13am

An Awk version

awk '
FNR==NR {                     # prevents loading Content.txt into array s   
    s[$0]                     # load Pattern.txt file into array s
    next                      # move to process next line of Pattern.txt
}
{
    for (p in s){             # iterate each pattern
        if(match($1, p)){     # check pattern for match against first field
            print             # print record if match is found
            next              # stop pattern iteration for this record, match was found already
        }
    }
}' Pattern.txt Content.txt

Perl version
Copy as search.pl and run as perl search.pl Pattern.txt Context.txt

#!/usr/bin/perl

# search.pl
# Perl facilities to help avoiding errors
use strict;
use warnings;

# files names to obtain from command line
my $pattern_file = shift or die;
my $context_file = shift or die;

# open pattern file for read
open my $fh, '<', $pattern_file or die;
# load pattern file into an array
my @patterns =<$fh>;
# dismiss patterns file handle
close $fh;
# remove the newline at end of record
chomp(@patterns);

# open context file for read
open $fh, '<', $context_file or die;
# iterate line by line through the context file
while(<$fh>) {
    # obtain the first field
    my ($field) = split;
    # search field for pattern; move to next line if match found
    for my $p (@patterns) {
        $field =~ /$p/ and print and next;
    }
}
# dismiss context file handle
close $fh;

Klasform · October 17, 2015, 5:15pm

I have been trying to solve this through sed.
inline!

sed -n -e '/\@{sed -e '1p' pattern.txt}/p' content.txt

Also tried curlys with many other combination, just can't get it working.
I Like the idea of passing the result of one sed to another with this sub {sed} convention. This is the code I found That put me in this direction

sed -e '/<TEXT1>/{r File1' -e 'd}' File2

particular example, which is not exactly what I need but tried to modify to fit this one.
No go!

---------- Post updated at 04:15 PM ---------- Previous update was at 04:01 PM ----------

Earlier I was able to pass it with xargs, but I still I would prefer sed only.

        sed -n '1p' < pattern.txt | xargs -I output sed -n '/\@output/p' content.txt

Been at it for a few hours.
thanks

RudiC · October 18, 2015, 6:43am

How about

sed  's/^/^[^ ]*/;s/$/ /' pattern | grep -f- content

krsnadasa · October 18, 2015, 10:38am

Many thanks to all

I tried Aia AWK version and it worked for me. However, if RudiC can explain his sed version that would be very helpful.

Thanks

RudiC · October 18, 2015, 11:58am

Actually, it's a grep version. The sed command just makes sure grep is working on the first field by adding the needed regex parts (no spaces up to the name, trailing space).

Klasform · October 18, 2015, 4:08pm

I had the impression the result required was the @Ram line, but I see all Ram should print.

sed -n '1p' < a.txt | xargs -I output sed -n '/output/p' b.txt

So this small adjustment prints all Ram in all forms.
Hope this helps.
I learned a whole lot with your problem, about /dev/stdout and r (read) option in sed. Mostly I'm now aware of them and trying to implement them in examples.

krsnadasa · October 19, 2015, 11:38am

Thank you ....