Match text to lines in a file, iterate backwards until text or text substring matches, print to file

shogun1970 · September 1, 2019, 4:33am

hi all,

trying this using shell/bash with sed/awk/grep

I have two files, one containing one column, the other containing multiple columns (comma delimited).

file1.txt
abc12345
def12345
ghi54321
...

file2.txt
abc1,text1,texta
abc,text2,textb
def123,text3,textc
gh,text4,textd
...

i'm trying to take each line in file1 and using the original text to match, if no match, iterate backwards one character at time, until it matches first column in file2, loop through all of file2 and print all matching lines where text and any substring matches the first column of file2 to another file. output file3 essentially will have concatenated output of original text from file1 and matching lines from file2

output example:

file3.txt
abc12345,abc1,text1,texta 
abc12345,abc,text2,textb
def12345,def123,text3,textc
ghi154321,gh,text4,textd
...

any help would be much appreciated.

Neo · September 1, 2019, 5:30am

Our general policy is for everyone here to "try and write their own script first" and then post what you tried.

Sometimes, our members forget and respond like a "script writing service"; but that is not our policy.

So please post the code you tried to write "on your own" and any error messages you got.

Thanks!

MadeInGermany · September 2, 2019, 2:04am

It is better to cycle through the 1st column in file2 and find a value* match in file1.
In shell it can be a done with a case statement.

Chubler_XL · September 3, 2019, 12:07am

shogun1970

Apart from showing us your attempts at this problem, could you also indicate if the order of records in the output file is important?

The problem description seems to indicate that a direct match record, when available, should be listed first and then other matches should be displayed in the same order as they appear in file1.txt.

However if the order of records in the output file is unimportant, the solution can be simplified a fair bit.

MadeInGermany · September 6, 2019, 1:36pm

A shell script that works as I described:

#!/bin/sh
# read from f1, print in this order
while read f1line
do
  # read from f2, find matches => print
  while IFS="," read f2col1 f2othercols
  do
    case $f1line in
    ("$f2col1"*)
      echo "$f1line,$f2col1,$f2othercols"
    esac
  done < file2.txt
done < file1.txt

The same idea in awk (file2 is read into an array variable first):

#!/bin/sh
awk -F"," '
{
  if (NR==FNR) {
  # read from f2 into associative array col1[]
    col1[$1]=($2 FS $3)
  } else {
  # read from f1, find matches => print
    for (c in col1)
      if (c == substr($0,1,length(c)))
        print $0 FS c FS col1[c]
  }  
}
' file2.txt file1.txt

In awk your propsed way can be implemented with no big overhead:

#!/bin/sh
awk -F"," '
{
  if (NR==FNR) {
  # read from f2 into associative array col1[]
    col1[$1]=($2 FS $3)
  } else {
  # read from f1, find matches => print
  for (i=length; i>=1; i--)
    if ((c=substr($0,1,i)) in col1)
      print $0 FS c FS col1[c]
  }  
}
' file2.txt file1.txt

RudiC · September 6, 2019, 3:02pm

Try also

awk -F, 'FNR==NR {T[$0]; next} {for (t in T) if (t ~ $1) print t, $0}' OFS=, file[12]
abc12345,abc1,text1,texta
abc12345,abc,text2,textb
def12345,def123,text3,textc
ghi54321,gh,text4,textd

MadeInGermany · September 7, 2019, 6:58am

The if (t ~ $1) is a RE match that is "fuzzy" unless it is anchored.
Should be if (t ~ ("^" $1)) ; the ^ anchor means the string $1 must occur at the beginning of string t.