Remove line containing string and renumber

LMHmedchem · October 11, 2014, 7:17pm

Hello,

I have some files in a directory and a short list of strings. I want to loop through the files and remove lines containing the string and renumber.

There are some issues. The first is the strings that can contain troublesome characters like single quotes and parenthesis. Here is one list of strings,

1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

It is very likely that the list will contain the same string more than once. I either need to clean that up or have the script allow for instances where the string is not found.

The other complexity is that the line numbering doesn't start until the 15th line of the file.

I was thinking of something like,

#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

# collect list of files
FILE_LIST=($(ls  './'*'out.txt' ))

# loop on files
for FILE in ${FILE_LIST[@]}
do
   echo $FILE

   # loop on strings to remove
   for REMOVE_STRING in ${REMOVE_LIST[@]}
   do
      echo $REMOVE_STRING
      # remove string, change cp to mv when this is working
      grep -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE'_tmp'
   done

done

This code works for the line removal but is rather inefficient since it has to make separate calls to grep for each item in the remove list and do that for every file. This does not have to be particularly fast, but I would prefer if it was not quite so moronic.

As for the line renumbering starting with the 15th line, I have no idea.

Suggestions would be appreciated.

---------- Post updated at 07:17 PM ---------- Previous update was at 06:22 PM ----------

This is part of one of the files. You can see that the numbering starts on the line following forder. If it helps, the numbers start on the first line that begins with a number. The forder field can have value from f0-f9. The number of columns and rows in the files vary. This example shows the first 8 columns and 10 data rows.

f0order	CVorder	Name	f0	RI_7	E99	E199	E299
NA	NA	NA	NA	R_r2	0.796	0.831	0.848
NA	NA	NA	NA	R_MeAE	88.54	80.06	76.27
NA	NA	NA	NA	R_MdAE	72.24	63.66	61.66
NA	NA	NA	NA	R_SE	104.44	96.49	92.37
NA	NA	NA	NA	T_r2	0.794	0.821	0.827
NA	NA	NA	NA	T_MeAE	108.38	105.79	99.11
NA	NA	NA	NA	T_MdAE	88.95	91.94	86.61
NA	NA	NA	NA	T_SE	107.44	105.46	104.84
NA	NA	NA	NA	V_r2	0.83	0.847	0.857
NA	NA	NA	NA	V_MeAE	108.36	103.86	97.23
NA	NA	NA	NA	V_MdAE	96.69	90.04	79.31
NA	NA	NA	NA	V_SE	102.58	103.24	102.13
f0order	CVorder	Name	f0	RI_7	E99	E199	E299
1	2	2-ethylpyridine	R	519	683	653	638
2	3	3-ethylpyridine	R	535	675	646	631
3	4	2,6-lutidine	R	506	632	614	608
4	5	2,5-lutidine	R	517	620	605	598
5	6	2,3-lutidine	R	518	612	598	592
6	7	3,4-lutidine	R	528	600	589	583
7	8	3,5-lutidine	R	532	569	560	559
8	9	2,4,6-collidine	R	544	585	586	590
9	10	4-(methylamino)pyridine	R	511	450	429	417
10	12	4-dimethylaminopyridine	R	533	500	487	481

The only thing I can think of at the moment would be to copy the first 14 lines to a temp file and then delete them. Then I would renumber the rest of the file and then cat the file back together.

LMHmedchem

Don_Cragun · October 11, 2014, 8:57pm

Do the lines in the remove list exist (unquoted) in a file? (If so, a single grep -Fvf string_file file would seem better for this problem than one grep invocation for each fixed string. But awk is probably better yet since it can do both the line removal and the renumbering.) Do any of these strings ever contain any whitespace characters?

Does the renumbering apply only to the 1st field in the lines to be renumbered? Or, does the 2nd field also need to be modified? (If so, how?)

You said that the number of rows and columns vary from file to file. Does the field to be matched also vary, or is it always the 3rd field? If it isn't always the 3rd field, is it always in a field with the string Name as the header in line 1 in that file? (An awk script will run faster if we know which field to match.)

LMHmedchem · October 11, 2014, 9:16pm

This is a script that currently works for this task. It is messy an un-elegant, but I am posting it since I sometimes think that working code is often a better explanation than a description given in prose, even where the code leaves allot to be desired.

#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

SET='A'
PARAM_SET='ON-0.25'
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
AS_LIST=(V_mae V_se S_mae S_se)

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
   # loop on as list
   for ANNEALING_SET in ${AS_LIST[@]}
   do

      # assign directory name
      FILE_DIR=$(ls -d './'$SET'/'$FOLD'/'$FOLD'_anneal/'$PARAM_SET'/'$ANNEALING_SET)
      # collect list of files
      FILE_LIST=($(ls $FILE_DIR'/'*'out.txt' ))

      # loop on files
      for FILE in ${FILE_LIST[@]}
      do
         echo $FILE

         # loop on strings to remove
         for REMOVE_STRING in ${REMOVE_LIST[@]}
         do
            echo $REMOVE_STRING
            # remove string, change cp to mv when this is working
            grep -F -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE
         done

         # re number data rows
         # copy first 14 lines to temp file
         sed 14q $FILE > './'$FILE_DIR'/'headers.txt
         # copy remaining lines to temp file
         sed -n '15,$p' $FILE > './'$FILE_DIR'/'data.txt
         # add new line numbers to data block
         nl './'$FILE_DIR'/'data.txt > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP  './'$FILE_DIR'/'data.txt
         # remove old numbering column
         cut './'$FILE_DIR'/'data.txt -f1,3- > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP './'$FILE_DIR'/'data.txt
         # recombine headers with data
         cat './'$FILE_DIR'/'headers.txt  './'$FILE_DIR'/'data.txt > $FILE
         # cleanup
         rm './'$FILE_DIR'/'headers.txt;  rm './'$FILE_DIR'/'data.txt

      done
   done
done

To answer your questions, the strings do not exist in a file, but that would be easy enough to create if it would be helpful.

Only the first column of the data section needs to be renumbered, the numbers in column 2 can remain as is.

The remove string will always refer to the 3rd field, and this will always be the name column.

The headers are on line 1 and then are duplicated on line 14.

There could be cases in the future where there are more than 14 rows before the data begins. In all cases, you could look for the second instance of the header row (some row that matches row 1) to know that the data starts on the next row.

LMHmedchem

Don_Cragun · October 12, 2014, 3:51am

You could try something like this as a replacement:

#!/bin/bash
# Initiailize variables:
AS_LIST=(V_mae V_se S_mae S_se)
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
PARAM_SET='ON-0.25'
REMOVE_LIST=(
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide'
	'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene'
	'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione'
	'1,4,5-triphenyl-4-imidazoline-2-thione'
	'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole'
	'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide'
	'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one'
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
SET='A'

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
	# loop on as list
	for AS in ${AS_LIST[@]}
	do

		# assign directory name
		FILE_DIR="./$SET/$FOLD/$FOLD_anneal/$PARAM_SET/$AS"
		# loop on files
		for FILE in "$FILE_DIR"/*out.txt
		do	echo "$FILE"
			printf '%s\n' "${REMOVE_LIST[@]}" | awk '
				BEGIN {	FS = OFS = "\t"
				}
				FNR == NR {
					# Gather remove list...
					rl[$0]
					next
				}
				FNR == 1 {
					# Get header from 2nd file.
					h = $0
					hc = 2
				}
				# Copy input lines until we have copied the
				# header line twice...
				hc {	if(h == $0) {
						# Decrement the # of times we
						# need to print the header...
						hc--
					}
					print
					next
				}
				# Skip lines with Name (field 3) in remove list.
				$3 in rl {
					next
				}
				{	# Renumber remaining lines.
					$1 = ++oc
				}
				1	# Print renumbered lines.
			' - "$FILE" > "$FILE"_ && mv "$FILE"_ "$FILE"
		done
	done
done

Scrutinizer · October 12, 2014, 2:02pm

Alternatively in case it starts on the 15th line, try:

awk 'NR==FNR{A[$1]; next} FNR==1{c=1; close(f); f=FILENAME ".new"} $3 in A{next} FNR>14{sub($1,c++)} {print>f}' rmlist *out.txt

where rmlist contains:

1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

Afterwards the new files will have a .new extension

Aia · October 12, 2014, 2:47pm

In case you could be interested in a Perl procedural solution that can be expanded to do more processing if needed.

#!/usr/bin/perl

use strict;
use warnings;

# the first file will contain patterns to match; stop the process if no file is given
my $pattern_file = shift or die "A file with searching patterns must be given: $!\n";
open my $match_lines, '<', $pattern_file or die "Could not open $pattern_file $!\n";

# index every line as a pattern to search
my %search = map { chomp; $_, 1 } <$match_lines>;
close $match_lines;

# other files to work on must be given. Process them one by one
while (@ARGV) {

	my $n; # counter to re-number lines 15 and above
	my $filename_in = shift; # get next filename to work on 
	my $filename_out = "new_" . $filename_in; # fabricate next output filename
	
	# create corresponding input and output file handles
	open my $cur_file_in, '<', "$filename_in" or die "Could not open $filename_in: $!\n";
	open my $cur_file_out, '>', "$filename_out" or die "Could not create $filename_out: $!\n";

	# feedback to stdout
	print "processing file $filename_in to $filename_out... ";

	# processing lines from current input file
	while (my $line = <$cur_file_in>) {
		
		# tokenizing into fields, separated by one or more spaces, the read line
		my @fields =  split /\s+/, $line;

		# explicitly, saying that is OK if the variable is not defined
		no warnings 'uninitialized';
		
		# keep any lines where the pattern is not found in third field
		if (not $search{$fields[2]}) {
			
			# lines above 14 to save gets a renumber sequence
			if ($. > 14 and $fields[0] ne '\n') {
				$fields[0]= ++$n;	
			}
			# results are finally written to disk
			print $cur_file_out "@fields\n";
		}
	}
	$n = 0; # clear the renumbering counter for next file
	# ready to recycle
	close $cur_file_in;
	close $cur_file_out;
	
	# feedback to stdout
	print "[done]\n";
}

Usage

perl prog.pl rlist foo.*

LMHmedchem · October 13, 2014, 7:56pm

Well the script I posted works, but it takes more than 38 minutes to run on the directory tree I tested. The test directory has 2000 files spread over 40 directories with 2446 lines per file and a remove list of 28 strings.

The script posted by Don Cragun finished in,
real 1m50.953s
user 3m0.410s
sys 0m49.394s

I like the fact that this script looks for the second instance of the header row to know where to start the renumbering. There are versions of my data where that would be useful.

For the Perl script posted by Aia, these files are in a directory structure of 40 different directories. I don't know perl well enough to set up the looping to troll through all of that and test the script. It seems as if I would have to run my script and then call your script and pass $FILE_LIST along with the file with the patterns to remove. Is that right?

From the post by Scrutinizer, I also do quite see how to loop through my directory structure and create lists of files to pass to the code.

Once again, it is amazing how much better a well formed script will perform.

LMHmedchem

Scrutinizer · October 13, 2014, 10:43pm

With the awk suggestion:

$ cat rmlist.awk

NR==FNR{
  A[$1]
  next
}

FNR==1 { 
  c=1
  close(f)
  f=FILENAME ".new"
} 

$3 in A {
  next
}
 
FNR>14 {
  sub($1,c++)
}

{
  print>f
}

You could try this example of how files that are in a directory hierarchy could be processed:

 find . -name '*out.txt' -exec awk -f rmlist.awk rmlist.dat {} +

Aia · October 13, 2014, 11:15pm

Don't know how you get your `wanted' files to process upon. I made it to accept any succession of paths, individually or glob. The only requirement is to pass it first a file containing the patterns to search.

LMHmedchem · October 22, 2014, 5:24pm

I have run into a problem with one of my patterns,
spiro[2,3-dihydrobenzothiazole-2,1'-cyclohexane]

The script doesn't like this because of the unmatched single quote. I have been putting all the patterns in single quotes to deal with special characters. I don't think I can escape this single quote because the strings are single quoted already.

This is the current list of patterns,

REMOVE_LIST=(
             '(2-iminochromen-3-yl)-N-(4-methoxyphenyl)carboxamide'
             '(4S,5R)-2,4,5-tri(4-pyridyl)-2-imidazoline'
             '[4-(tert-butyl)phenyl]-N-(4-piperidylphenyl)carboxamide'
             '{[(1E)-2-(4-phenylphenyl)-1-azaprop-1-enyl]amino}[(4-methylphenyl)amino]methane-1-thione'
             '1-(4-methylpentyl)-5-(pyrrol-2-ylmethylene)-1,3-dihydropyrimidine-2,4,6-trione'
             '1,2-bis(2,5-dimethyl-3-thienyl)ethane-1,2-dione'
             '1-methyl-2-(3-(2-quinolyl)(4-quinolyl))benzimidazole'
             '2-((1E,4E)-3,5-di(2-thienyl)-2,4-diazapenta-1,4-dienyl)thiophene'
             '2,3,3-trimethyl-3H-pyrrolo[3,2-h]quinoline'
             '2-[6-((4aS,9bR)-2,8-dimethylpiperidino[4,3-b]indolin-5-yl)-6-oxohexyl]benzo[c]azoline-1,3-dione'
             '2-piperidylphenol'
             '3-(2H,3H-benzo[3,4-e]1,4-dioxan-6-ylazamethylene)isoindolylamine'
             '3,4-diphenyl-2-(phenylazamethylene)-1,3-thiazoline'
             '3-[(2,4-dimethylphenyl)amino]-1-phenylpropan-1-one'
             '3-pyridyl_(2E)-3-phenylprop-2-enoate'
             '4-[(2S,6S)-2,6-bis(2-methylprop-2-enyl)-4-1,2,5,6-tetrahydropyridyl]hepta-1,6-dien-4-ylamine'
             '5-acetyl-1,3-diphenyl-2-pyrazoline'
             '5-dodecyl-1,3,5-triazaperhydroine-2-thione'
             '6,9-dimethyl-2,3,4,9-tetrahydro-4aH-carbazol-1-one'
             'bitolterol'
             'ethyl_2-[3-(ethoxycarbonyl)-2-pyridyl]pyridine-3-carboxylate'
             'methyl_3-[(3-imino-2H-benzo[c]azolidinylidene)azamethyl]benzoate'
             'N-[(1-butyl-2-oxobenzo[d]azolin-3-ylidene)azamethyl]-2-(phenylamino)acetamide'
             'spiro[2,3-dihydrobenzothiazole-2,1'-cyclohexane]'
            )

Any ideas about how to deal with this?

LMHmedchem

Don_Cragun · October 22, 2014, 5:57pm

Change the line:

             'spiro[2,3-dihydrobenzothiazole-2,1'-cyclohexane]'

to:

             "spiro[2,3-dihydrobenzothiazole-2,1'-cyclohexane]"

LMHmedchem · October 22, 2014, 6:19pm

I actually tried,

"spiro[2,3-dihydrobenzothiazole-2,1\'-cyclohexane]"

because I thought I would have to escape the single quote.

The escaped version did work, but is it generally better to use double quotes for a situation like this with allot of potential special characters?

LMHmedchem

Don_Cragun · October 22, 2014, 6:50pm

You don't have to escape single quotes inside double quotes and you don't have to escape double quotes inside single quotes.

Choosing whether to use single quotes or double quotes depends on what the strings you're quoting contain.

If you want parameter expansion, command substitution, or arithmetic expansion in your strings, or if you have single quotes in your strings; use double quotes. If your strings contain single quotes or contain dollar signs that could appear to the shell to be parameter expansions, command substitutions, or arithmetic expansions that you want to appear as literal (unexpanded, unsubstituted) character sequences; use single quotes. If your strings don't contain any quotes or dollar signs, single quoting may be slightly faster because the shell doesn't have to look for dollar signs as it evaluates the string.

If you have single quotes and double quotes in your strings, if you want to expand or substitute some, but not all, parameter expansions, command substitutions, and arithmetic expansions; concatenate various types of quoted strings. For instance if you want the string "'" you can use:

'"'"'"'"'

and to expand one variable and leave another unexpanded:

"$ExpandMe"'$DoNotExpandMe'