Python script for extracting data using two files

Hello,
I have two files.
File 1 is a list of interested IDs

Ex1
Ex2
Ex3

File 2 is the original file with over 8000 columns and 20 millions rows and is a compressed file .gz

Ex1 xx xx xx xx ....
Ex2 xx xx xx xx ....
Ex2 xx xx xx xx ....

Now I need to extract the information for all the IDs of interest from File 1. I have a script that should do that

import argparse
import gzip
if __name__ == '__main__':
    parser = argparse.ArgumentParser
    parser.add_argument('--file',action='store',dest='file',help="FILE2")
    parser.add_argument('--IDs', action='store',dest='ids',help='FILE1')
    parser.add_argument('--header', action='store_true',dest='header',help='TRUE or FALSE') 
    args = parser.parse_args()
    
    file = gzip.open(args.file, 'rb')
    idfile = open(args.ids, 'r')
    if(args.header):
        idfile.next()
    id = set([s.rstrip() for s in idfile])
    idfile.close()
    oname = args.file[:-7] + 'result.txt' 
    o = open(oname, 'w')
    o.write(file.next())
    for l in file:
        tmp = l.rsplit('\t')
        if(tmp[0].rstrip() in ids):
            o.write(l)
    o.close()

but I get an error, which I don't understand as this script was used on the same file as before and it worked.. not sure what is going on in here... anyone help?

File "extract.py", line 24, in <module>
    for l in file:
  File "/usr/lib64/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib64/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib64/python2.7/gzip.py", line 307, in _read
    uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid block type

Is it possible that your gzipped file is corrupt?

I don't think so as this worked before but is there any way I could find out if the file is corrupt ?

Try to gunzip it from the command line.

1 Like

yes, you were correct. I tried to gunzip it and it gave an error. The problem has been sorted out now. Thank you.