sed to remove all lines in file that are not .vcf.gz extention

I am trying to use sed to remove all lines in a file that are nor vcf.gz . The sed below runs but returns all the files with vcf.gz in them, rather then just the ones that end in only that extention. Thank you :).

file

/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz.tbi
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.genome.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.genome.vcf.gz.tbi
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz.tbi
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.genome.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.genome.vcf.gz.tbi

desired output

/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz

sed

sed -i '/.vcf.gz/!d' file
sed  '/.vcf.gz$/!d' file
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.genome.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.genome.vcf.gz

Again your spec is incorrect, here in the desired output derived from your input.

1 Like

Hello cmccabe,

The desired output you have shown doesn't look like it needs only those records which have vcf.gz at end, if this is the case then 2 more records are left in your shown output line number 3 and 7. If in case you want to get output as I mentioned then you could try following with sed .

sed -n '/.vcf.gz$/p'   Input_file

Also sed -i option writes output into it's Input_file itself so please beware of using it.

Thanks,
R. Singh

1 Like

I suppose a sideways thought on this would be "How are you creating the list?"

If it is a find then you could add a bit that says -name "*.cvf.gz" as in:-

find /output/Home -name "*.cvf.gz"

I hope that this helps, or at least doesn't get in the way.

Robin

1 Like

I can not seem to remove the .genome.vcf.gz from the output. Thank you :).

file

/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
IonXpress_007
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz.tbi
IonXpress_007
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.genome.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz
IonXpress_007
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz.tbi
IonXpress_007
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.genome.vcf.gz

output

/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.genome.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.genome.vcf.gz

desired output

/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz

Hello cmccabe,

If there is always a file which is ending with any digit and then have .vcf.gz eg--> _007.vcf.gz or _008.vcf.gz .
Then following may help in same.

sed -n '/[0-9].vcf.gz$/p'   Input_file

OR

awk '($0 ~ /[0-9].vcf.gz$/)'   Input_file

Thanks,
R. Singh

1 Like

Please read your post#1 carefully. WHERE did you specify THAT?
Everyone who answered ran in a false direction first!

1 Like
sed  '/.vcf.gz$/!d;/genome/d' file
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_007/TSVC_variants_IonXpress_007.vcf.gz
/output/Home/Auto_user_S5-00580-5-Medexome_66_030/plugin_out/variantCaller_out.40/IonXpress_008/TSVC_variants_IonXpress_008.vcf.gz
1 Like

Yes, either that or, if you want to filter out specifically the files *genome.vcf.gz , then:

sed '/vcf.gz$/!d;/genome.vcf.gz$/d' /path/to/input

I hope this helps.

bakunin

1 Like

Thank you all :). I apologize for not being more clear in my post and will ensure that I am in the future, thanks again.