Read a lis, find items in a file from the list, change each item

LMHmedchem · December 1, 2016, 11:14pm

Hello,

I have some tab delimited text data,
file: final_temp1

aname	val
NAME;r'(1,)	3.28584
r'(2,)<tab>
NAME;r'(3,)	6.13003
NAME;r'(4,)	4.18037
r'(5,)<tab>

You can see that the data is incomplete in some cases. There is a trailing tab after the first column for each incomplete row. I have added the notation above to make that clear

I also have a list of the incomplete cases.
file: incomplete_case_list

r'(2,)	
r'(5,)

What I need to do is to work through the list of incomplete cases to find the matching row in my file and alter it. I need to add "NAME;" as a prefix to the first column value, followed by tab, followed by the word "failed"

aname	val
NAME;r'(1,)	3.28584
NAME;r'(2,)	failed
NAME;r'(3,)	6.13003
NAME;r'(4,)	4.18037
NAME;r'(5,)	failed

I thought I could just loop through the incomplete file list and make sed substitutions,

# loop through incomplete file list
while read line; do 

   # remove tab from end of line
   clean_line=$(echo  $line | sed "s/\t//1")

   # create new line
   new_line='NAME;'$clean_line'\t''failed'

   # find original line and replace with modified version
   sed "s/$line/$new_line/1" final_temp1 > final_temp2

   # overwrite original file with modified file to propagate changes forward
   mv final_temp2  final_temp1

done < incomplete_case_list

I am getting a sed error,

sed: -e expression #1, char 160: Invalid range end
sed: -e expression #1, char 168: Invalid range end
sed: -e expression #1, char 134: Invalid range end

I don't think this is from the first sed command (substituting the tab) but the error is not very clear to me. In my real files, the values in the name column can have a number of characters like comma, unmatched single quotes, parenthesis, square brackets, and curly braces. I am wondering if sed is rejecting some of these characters. I tried putting double quotes around $line and $new_line in the second sed command, but that doesn't help

I tried replacing the sed line with awk,

awk -v var1="$line" -v var2="$new_line" '{gsub(var1, var2, $0); print}' final_temp1 > final_temp2

This gives me the error,

awk: cmd. line:1: (FILENAME=final_temp1 FNR=1) fatal: Invalid range end: /1-[10-(4-amino-2-methylquinolyl)decyl]-2-methyl-4-quinolylamine_4Np.mol/

The is one of the messy names from actual data. Is there something in this string that needs to be handled differently. I frequently use both sed an awk with data like the this and I have not seen this error before.

I am not sure if sed will find the pattern because the line terminates with a tab and I am not sure that is being read into "line" during the while loop. I also don't know if there is still and end of line character there or not. I suppose I could strip out all trailing whitespace character first.

The repetitive overwriting of the files is also expensive but it is unlikely that there will ever by very many entries in the incomplete_case_list.

Are there any comments on what I am doing wrong here, or a better method all together?

Thanks,

LMHmedchem

itkamaraj · December 2, 2016, 12:55am

try this...

awk 'NF<2{$0="NAME;"$0"\tfailed"}1' incomplete_case_list
aname   val
NAME;r'(1,)     3.28584
NAME;r'(2,)     failed
NAME;r'(3,)     6.13003
NAME;r'(4,)     4.18037
NAME;r'(5,)     failed

LMHmedchem · December 2, 2016, 11:52am

Thank you for the suggestion. I don't see the name of the file that I am processing here, just the name of the file with the failed rows. Am I missing something?

The script below works and is pretty fast.

#!/bin/sh

# file with list of name with incomplete output
incomplete_case_list=$1
# file being processed (replace incomplete rows with modified data)
final_temp1=$2
# output file
final_temp2=$3

# read in fail file and create array of names
while read line; do 

   # read tab separated line into array
   unset FIELD;   IFS=$'\t' read -a FIELD <<< "$line"

   # add each name to array
   fail_list=("${fail_list[@]}" "${FIELD[0]}")

done < $incomplete_case_list

# flag to avoid second print if line was replaced
replaced='0'

# loop through all rows of file to check for fail names
# check the name for each row against all names in name array, look for match
while read line; do 

   # read tab separated line into array
   unset FIELD;   IFS=$'\t' read -a FIELD <<< "$line"

   # check current line against each element in array of fail names
   for fail_name in "${fail_list[@]}"
   do

      # check name filed (0), if a match is found, print modified line
      if [ "${FIELD[0]}" == "$fail_name" ]; then

         # output modified row to next temp file
         echo -e 'NAME;'${FIELD[0]}'\t''failed' >> $final_temp2

         # set flag to indicate row has been replaced, don't print again
         replaced='1'
      fi
   done

   # if name was not found in the  fail array, print original line
   if [ "$replaced" == '0' ]; then
      echo -e ${FIELD[0]}'\t'${FIELD[1]} >> $final_temp2
   fi
   # reinitialize flag
   replaced='0'

done < $final_temp1

For lines that were printed unchanged, I was going to just echo $line,

echo -e $line >> final_temp2

This works, but I get space delimited output and not tab. I thought that using echo -e would address that. It is almost like IFS=$'\t' read -a is converting the tabs to spaces when the line is read in. Is there a way to address that situation?

LMHmedchem