Bash to print if keyword not in file

I am trying to create an output file new that contains only the S5-00580 lines from list that are not in analysis_log . My attempt to do this is below.

The new file would be used in the aria2c command to download only new folders. The aria2c command works to download all the files in list , but if they already exist in analysis_log then those lines can be skipped.

Also, all S5-00580-17-Medexome lines are used that text is there and I can not figure out how to ignore lines that have a keyword in them test .... basically exclude all lines that do not end with Medexome.tar.bz2 . Thank you :).

diff -u list analysis_log  | sed -nr 's/^+([^S5-].*)/\1/p' > new

list

http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-19-Medexome_122_059/plugin_out/FileExporter_out.137/R_2016_12_09_14_01_11_user_S5-00580-19-Medexome.tar.bz2
http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-18-Medexome_121_057/plugin_out/FileExporter_out.134/R_2016_12_09_11_18_52_user_S5-00580-18-Medexome.tar.bz2
http://xxx.xx.xxx.xxx/output/Home/Auto_S5-00580-17-Medexome_5224_9680c70_120_056/plugin_out/FileExporter_out.125/R_2016_12_07_12_25_50_S5-00580-17-Medexome_5224_9680c70.tar.bz2
http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-17-Medexome_119_054/plugin_out/FileExporter_out.122/R_2016_12_05_13_30_48_user_S5-00580-17-Medexome.tar.bz2
http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-16-Medexome_118_052/plugin_out/FileExporter_out.119/R_2016_12_05_10_45_37_user_S5-00580-16-Medexome.tar.bz2

analysis_log

R_2016_11_18_10_45_10_user_S5-00580-17-Medexome
R_2016_11_18_13_19_32_user_S5-00580-16-Medexome
# verify new files with list call
         line_no=$(awk '{x++} END {print x}' /home/cmccabe/s5_files/downloads/new) # count new files and store as variable
         if [[ -s /home/cmccabe/s5_files/downloads/new ]]; then
     echo "starting download of $line_no new S5 sequencing run"
else
    echo " no new files to analyze, goodbye "
    exit 1
fi

# download all from list
while read new; do
         echo $new
aria2c -x8 -l /home/cmccabe/log.txt -c -d /home/cmccabe/Desktop/NGS/API --use-head=true --http-user "xxxx"  --http-passwd xxxx "$new"
done < /home/cmccabe/s5_files/downloads/new
rm /home/cmccabe/s5_files/downloads/list
rm /home/cmccabe/s5_files/downloads/new

desired output of new only these two lines are printed because the S5-00580 was not in the analysis_log

http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-19-Medexome_122_059/plugin_out/FileExporter_out.137/R_2016_12_09_14_01_11_user_S5-00580-19-Medexome.tar.bz2
http://xxx.xx.xxx.xxx/output/Home/Auto_user_S5-00580-18-Medexome_121_057/plugin_out/FileExporter_out.134/R_2016_12_09_11_18_52_user_S5-00580-18-Medexome.tar.bz2

Hi, try:

awk -F_ 'NR==FNR{A[$NF]; next} {p=1; for(i in A) { if($0~i || $0!~"Medexome\.tar\.bz2") p=0}}p' analysis_log list > new
1 Like
awk -F_ 'NR==FNR{A[$NF]; next} {p=1; for(i in A) { if($0~i || $0!~"Medexome\.tar\.bz2") p=0}}p' /home/cmccabe/analysis_log /home/cmccabe/files/downloads/list > /home/cmccabe/files/downloads/new

awk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'

a new file does get created with the line that is not in analysis_log , so it apperas to be working just not sure what the error means (seems like "Medexome\.tar\.bz2" is causing the error message? Thank you very much :).

---------- Post updated at 06:58 AM ---------- Previous update was at 06:54 AM ----------

if I remove the \ , I get no message... but are they need? Thank you :).

So it is "Medexome.tar.bz2"

Hello cmccabe,

As per message it is only warning so definitely program will not be getting impacted. Off course if message is saying you could remove \. to . , yes you could try it out, it shouldn't affect code(though I didn't try it).

Thanks,
R. Singh

1 Like

Thank you both :).

Hi, cmccabe, yes they are needed. My suggestion was slightly incorrect. Try this instead:

awk -F_ 'NR==FNR{A[$NF]; next} {p=1; for(i in A) { if($0~i || $0!~/Medexome\.tar\.bz2/) p=0}}p' analysis_log list

This is a "regex constant" expression..

The double quotes (regex string) would be possible as well, but the dots then would need an extra escape:

awk -F_ 'NR==FNR{A[$NF]; next} {p=1; for(i in A) { if($0~i || $0!~"Medexome\\.tar\\.bz2") p=0}}p' analysis_log list

\. is necessary, since a . would mean "any character" instead of a literal dot. It would probably work too, but theoretically it could mean a false positive...

1 Like

Thank you very much for your help :).

For plain string matches consider the index function

if (index($0,i) || ! index($0,"Medexome.tar.bz2")) p=0