Appending a column in xlsx file using Python

I'm using Libre Office, an older version, not microsoft excel. So what would be the appropriate way of calling E4:G4:F4 -> V4

What do you mean by point 2 ? Will.it then be
dpos = get_text_data(txt_filename)

You form 'E4' first and get its cell value.
Then you form 'G4' and get its cell value.
And then do the same for 'F4'.
Then you form "Scores" namedtuple using the cell values of 'E4', 'G4' and 'F4'.
Then you check if that "Scores" namedtuple is a key in dictionary "dpos".
If it is, then get its value from "dpos" and paste in cell 'V4'; otherwise paste the "Unknown" string in 'V4'.
So, given:

pos_col_no = 'E'
alt_col_no = 'G'
ref_col_no = 'F'
row_no = 4

how do you form 'E4'?

That is closely related to point 1.
Were you able to implement point 1 first?
The "function call" mentioned in point 1 looks the same as "function signature" mentioned in point 2.

Have a look at this tutorial to see if it helps:
1.7. Print Function, Part I � Hands-on Python Tutorial for Python 3.1
1.11. Defining Functions of your Own � Hands-on Python Tutorial for Python 3.1

Or looking at it another way; in the following statement:

dpos = get_text_data(txt_filename)     

(1) what is the name of the function? and
(2) what is the name of the parameter?

Now, if you remove the parameter, then what will the above statement look like? (Have a look at the tutorial; it tells you what a function parameter is.)

1 Like

Thank you. Its just that this language is quite new and I am happy to learn it (just a bit slow!) So based on all your inputs, I have modified the code a bit more, now it prints the correct score based on 3 columns on the command line but
a) does not saves it to the excel sheet and
b) it does not read beyond row 4

So the code now is

#!/usr/bin/python

import sys
import os
from openpyxl import load_workbook
from datetime import datetime
from pandas import read_table
import csv
from collections import namedtuple
import csv


# Variables
sheet_directory = r'/home/test'

def process_xl_sheets():
    dict_pos = {}
    Scores = namedtuple("Scores", ["POS", "ALT", "REF"])
    first_line = True
    with open('/home/test/scores.txt') as txt_filename:
        for line in txt_filename:
            if first_line:
                first_line = False
                continue
            line = line.rstrip('\n')
            x = line.split('\t')
            cpos = Scores(POS=x[0], ALT=x[2], REF=x[1])
            dict_pos[cpos] = x[3]
            #print dict_pos ##prints

    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                #print(sheet_file) ##prints
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('raw_data')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                NG = score_col_no
                row_no = 4
                # cell = ws[pos_col_no  + str(row_no)]
                cell_pos = ws[pos_col_no + str(row_no)]
                cell_ref = ws[ref_col_no + str(row_no)]
                cell_alt = ws[alt_col_no + str(row_no)]
                NG = ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B")
                cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                print cpos ##prints only row E4:G4:F4 and not all
                #print dict_pos ##prints all the scores
                if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        print dict_pos[cpos] ##prints the actual score
                else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
    wb.save(sheet_xl_file)

process_xl_sheets()

and the command line results are

/usr/bin/python2.7 /home/test/new_dict.py

##reads only row 4 
Scores(POS=u'73', ALT=u'G', REF=u'A')

##prints all values from 'score.txt' 
{Scores(POS='2171', ALT='C', REF='T'): '5', Scores(POS='73', ALT='G', REF='A'): '11', Scores(POS='114', ALT='T', REF='C'): '1', Scores(POS='2080', ALT='C', REF='T'): '4', Scores(POS='1189', ALT='C', REF='T'): '1'}
 
##returns the value from score.txt against row 4 of xlsx but does not save it to excel 
11 

 Process finished with exit code 0

b) You did not tell Python to go beyond row 4! That's why it did not read beyond row 4.

The earlier iterations of this program had the "while" statement in them, which you removed. That "while" statement was for going beyond row 4.

What you want to do is something like the following pseudo-code:

 Go to row 4.
 Get the values of E4, G4, F4 and cpos.
 If cpos exists as a key in dict_pos, then print the value otherwise print the "unknown" string in cell V4.

 Go to row 5.
 Get the values of E5, G5, F5 and cpos.
 If cpos exists as a key in dict_pos, then print the value otherwise print the "unknown" string in cell V5.

 Go to row 6.
 Get the values of E6, G6, F6 and cpos.
 If cpos exists as a key in dict_pos, then print the value otherwise print the "unknown" string in cell V6.

 Go to row 7.
 Get the values of E7, G7, F7 and cpos.
 If cpos exists as a key in dict_pos, then print the value otherwise print the "unknown" string in cell V7.

 ...
 Keep doing this until you reach a row whose E, G and F columns are empty.
 
1 Like

I did use the while statement before but that just printed the value of the score continously without stopping. This is what I did

            NG = ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B")
            cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
            print cpos ##prints only row E4:G4:F4 and not all
            while cpos:
                if (dict_pos.has_key(cpos)):
                    NG = dict_pos[cpos]
                    print dict_pos[cpos] ##prints the actual score
                else:
                    NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
            wb.save(sheet_xl_file)

process_xl_sheets()

and there was no value on the excel sheet saved and also the code did not stop

/usr/bin/python2.7 /home/test/new_dict.py 
 ##reads only row 4  
Scores(POS=u'73', ALT=u'G', REF=u'A')  
##prints all values from 'score.txt'  
{Scores(POS='2171', ALT='C', REF='T'): '5', Scores(POS='73', ALT='G', REF='A'): '11', Scores(POS='114', ALT='T', REF='C'): '1', Scores(POS='2080', ALT='C', REF='T'): '4', Scores(POS='1189', ALT='C', REF='T'): '1'}   
##returns the value from score.txt against row 4 of xlsx but does not save it to excel  11
11
11
11
11
11
11
11
11
.... 



That's because the "while" statement went into what is called an "infinite loop".
That's because "cpos" is always True everytime it is checked.
[ It means that "cpos" always has a value everytime it is checked and that value is Scores(POS='73', ALT='G', REF='A') ]

You want "cpos" to change in every iteration of the "while" loop.
And you want the row_no to increment in every iteration of the "while" loop.

You are incrementinng row_no and changing cpos within the else branch.
You need to increment row_no and change cpos within the while statement.
In Python, the indentation of the statement decides what is done "within" what.

---------- Post updated at 02:12 PM ---------- Previous update was at 10:33 AM ----------

Here's how your code is being executed; I have added line numbers at the left to help track the path of execution:

 1    NG = ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B")
 2    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
 3    print cpos ##prints only row E4:G4:F4 and not all
 4    while cpos:
 5        if (dict_pos.has_key(cpos)):
 6            NG = dict_pos[cpos]
 7            print dict_pos[cpos] ##prints the actual score
 8        else:
 9            NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
10            row_no += 1
11            cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
12    wb.save(sheet_xl_file)
  
Line 1 => Set the value of NG to 'Unknown_June'.
Line 2 => Set the value of cpos to Scores(POS='73', ALT='G', REF='A')
Line 3 => Print the value of cpos
  
Line 4 => Is cpos True? Yes it is.
Line 5 => Does dict_pos have the key cpos = Scores(POS='73', ALT='G', REF='A')? Yes it does.
Line 6 => Set the value of NG to dict_pos[cpos] = 11
Line 7 => Print the value of dict_pos[cpos] = 11
Line 8 => Disregard the "else" branch and all 3 statements within it (lines 9, 10, 11) because "if" branch was processed. The control goes back to Line 4.
  
Line 4 => Is cpos True? Yes it is.
Line 5 => Does dict_pos have the key cpos = Scores(POS='73', ALT='G', REF='A')? Yes it does.
Line 6 => Set the value of NG to dict_pos[cpos] = 11
Line 7 => Print the value of dict_pos[cpos] = 11
Line 8 => Disregard the "else" branch and all 3 statements within it (lines 9, 10, 11) because "if" branch was processed. The control goes back to Line 4.
  
Line 4 => Is cpos True? Yes it is.
Line 5 => Does dict_pos have the key cpos = Scores(POS='73', ALT='G', REF='A')? Yes it does.
Line 6 => Set the value of NG to dict_pos[cpos] = 11
Line 7 => Print the value of dict_pos[cpos] = 11
Line 8 => Disregard the "else" branch and all 3 statements within it (lines 9, 10, 11) because "if" branch was processed. The control goes back to Line 4.
  
Line 4 => Is cpos True? Yes it is.
Line 5 => Does dict_pos have the key cpos = Scores(POS='73', ALT='G', REF='A')? Yes it does.
Line 6 => Set the value of NG to dict_pos[cpos] = 11
Line 7 => Print the value of dict_pos[cpos] = 11
Line 8 => Disregard the "else" branch and all 3 statements within it (lines 9, 10, 11) because "if" branch was processed. The control goes back to Line 4.
  
Line 4 => Is cpos True? Yes it is.
Line 5 => Does dict_pos have the key cpos = Scores(POS='73', ALT='G', REF='A')? Yes it does.
Line 6 => Set the value of NG to dict_pos[cpos] = 11
Line 7 => Print the value of dict_pos[cpos] = 11
Line 8 => Disregard the "else" branch and all 3 statements within it (lines 9, 10, 11) because "if" branch was processed. The control goes back to Line 4.
  
...
...
...
And so on, ad infinitum...
...
...

You have to give a chance for Line 10 and 11 to execute.

1 Like

I'm not quite sure what is going wrong, I have tried this as well

                 while cpos:
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        print dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                wb.save(sheet_xl_file)

it still goes into an infinite loop.
Also, why does it not save '11' in the excel workbook for the first row and instead prints 'unknown' even though it shows the correct score on the terminal?

Print the value of "cpos" inside the "while" loop.
It is the same every time.
That's the reason for the infinite loop.
You need the value of "cpos" to change in every iteration of the loop.

1 Like

Isn't the value of cpos inside the while loop ?

                while cpos:
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        #print dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                    print cpos
                wb.save(sheet_xl_file)

Yes, "cpos" is inside the while loop. But is it changing every time the "row_no" changes?

Do this:
1) Print "row_no" and "cpos". (You are printing cpos, so print row_no as well.)

2) Then compare the values of "row_no", "cpos" printed by your Python program with the data in your Excel spreadsheet.

For row_no = 4, the Python "cpos" should reflect Excel's cells E4, G4, F4
For row_no = 5, the Python "cpos" should reflect Excel's cells E5, G5, F5
For row_no = 6, the Python "cpos" should reflect Excel's cells E6, G6, F6
...

Try it and see if that's what is actually happening.
If not, then that is wrong and the cause for the infinite loop.

1 Like
                while cpos:
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        print dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    print row_no
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                    #print cpos
                wb.save(sheet_xl_file)

Im not able to print the row no either as

 print dict_pos[cpos]

goes into a loop. I dont know how to proceed.

Don't print "dict_pos[cpos]". It was commented out in your earlier code anyway. Remove it or comment it out.
Simply print row_no and cpos.
Uncomment the "#print cpos" line.

when I do that, then the values of cpos goes into a loop

Scores(POS=u'73', ALT=u'G', REF=u'A')
2921689
Scores(POS=u'73', ALT=u'G', REF=u'A')
2921690
Scores(POS=u'73', ALT=u'G', REF=u'A')
2921691
Scores(POS=u'73', ALT=u'G', REF=u'A')


Ok, that's more descriptive now.
So the numbers you see: 2921689, 2921690, 2921691 are the row numbers of your spreadsheet.
The "Scores" value is the Python namedtuple value corresponding to (73, 'G', 'A').

Now in your Excel spreadsheet, (73, 'G', 'A') are the values of (E4, G4, F4), most likely <== can you confirm that?

So what's happening is that you are going to row_no = 5, but cpos is still equal to "Scores(pos=E4, alt=G4, ref=F4)".
When you are in row_no = 5, you need to recalculate cpos by picking up values from E5, G5, F5.

Same thing with row_no = 6, 7, 8, .... 2921689, 2921690, ...

In case the output is being printed too fast, change the condition "while cpos:" to "while row_no <= 10:".

---------- Post updated at 08:16 AM ---------- Previous update was at 08:10 AM ----------

Here is what your output should look like:

Scores(POS='73', ALT='Blah', REF='Blah')
5
Scores(POS='114', ALT='Blah', REF='Blah')
6
Scores(POS='263', ALT='Blah', REF='Blah')
7
Scores(POS='309', ALT='Blah', REF='Blah')
8
Scores(POS='497', ALT='Blah', REF='Blah')
9
Scores(POS='513', ALT='Blah', REF='Blah')
10
Scores(POS='750', ALT='Blah', REF='Blah')
11

This is based on the "S12.xlsx" you posted earlier in this thread.

1 Like

Correct, those values are of (E4, G4, F4). But with "row_no += 1", shouldnt it pick up the values from the next row ? I thought this while loop meant 'go to row 4, if the row matches the score from the dictionary, print the value or else print unknown and then go on to the next row and do the same' How would I recalculate the values in a loop for all rows ?

                while row_no <=10:
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        #print dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    print row_no
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                    print cpos
                wb.save(sheet_xl_file)
/usr/bin/python2.7 annotate_new_dict.py
5
Scores(POS=u'73', ALT=u'G', REF=u'A')
6
Scores(POS=u'73', ALT=u'G', REF=u'A')
7
Scores(POS=u'73', ALT=u'G', REF=u'A')
8
Scores(POS=u'73', ALT=u'G', REF=u'A')
9
Scores(POS=u'73', ALT=u'G', REF=u'A')
10
Scores(POS=u'73', ALT=u'G', REF=u'A')
11
Scores(POS=u'73', ALT=u'G', REF=u'A')

Process finished with exit code 0

No, it should not.
It will not.
No programming language will do that for you.
Golden rule of programming => a programming language will only do what you tell it to do. It will not do anything on its own.

The "while" loop is only for repeating a "set of actions" while a "condition" is true.

The repetitive "set of actions" is:
1) going to next row
2) determining pos value, alt value, ref value for the row we reached
3) constructing the cpos value from the 3 values determined in the step above
4) checking if this cpos value is in the dictionary dict_pos
5) setting the value of "score" cell in the row that we are now

Since 1) through 5) are repetitive, we work on the entire spreadsheet, but one row at a time.
Inside a "while" loop, we are only working on one row - the current row.

The "condition" is: either pos value is non-empty or alt value is non-empty or ref value is non-empty.

As you can see, the structure of a "while" loop is very generic (repeat a set of actions while a condition is true), so it can be used in a wide variety of situations in any programming language.

Old proverb: "How to eat an elephant? One bite at a time." :slight_smile:

As I mentioned earlier, inside a loop, you only have to calculate the pos, alt and ref values for one row - the current row, the row you are on.

Now, for row_no = 4, you had a few statements that calculated cell_pos, cell_alt and cell_ref. Then you calculated "cpos" from those cell_pos, cell_alt and cell_ref values.

You need those statements inside the "while" loop.
So once the row_no is incremented inside the loop, you:
1) determine the cell_pos, cell_alt and cell_ref
2) next you calculate the cpos using cell_pos, cell_alt and cell_ref
This will be your "cpos" - only for the current row.
When printed, you will see different values of cpos - printed one at a time from within your "while" loop.

(The output that I showed in my earlier post was not printed in one shot - it was printed during each iteration of the loop - one line printed per iteration.)

1 Like

okay, I'm getting there now slowly! But why isn't it saving the results to the excel sheet ?

     for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                #print(sheet_file)
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('Unannotated')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                NG = ws[score_col_no + str(row_no)]
                #print row_no
                cell_pos = ws[pos_col_no + str(row_no)]
                cell_ref = ws[ref_col_no + str(row_no)]
                cell_alt = ws[alt_col_no + str(row_no)]
                cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                print cpos  ##prints only row E4:G4:F4
                #while cpos:
                while row_no <=10:
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                        print dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    #print row_no
                    cell_pos = ws[pos_col_no + str(row_no)]
                    cell_ref = ws[ref_col_no + str(row_no)]
                    cell_alt = ws[alt_col_no + str(row_no)]
                    cpos = Scores(POS=cell_pos.value, ALT=cell_alt.value, REF=cell_ref.value)
                    print cpos
                    if (dict_pos.has_key(cpos)):
                        NG = dict_pos[cpos]
                    else:
                        NG = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        print NG
                wb.save(sheet_xl_file)
process_xl_sheets()
/usr/bin/python2.7 annotate_new_dict.py
Scores(POS=u'73', ALT=u'G', REF=u'A')
11
Scores(POS=u'114', ALT=u'T', REF=u'C')
1
Scores(POS=u'263', ALT=u'G', REF=u'A')
Unknown_July2017
Scores(POS=u'309', ALT=u'T', REF=u'C')
Unknown_July2017
Scores(POS=u'497', ALT=u'T', REF=u'C')
Unknown_July2017
Scores(POS=u'513', ALT=u'G', REF=u'GCA')
Unknown_July2017
Scores(POS=u'750', ALT=u'G', REF=u'A')
1
Scores(POS=u'1189', ALT=u'C', REF=u'T')

Process finished with exit code 0

Yes, much better.
Now you are able to capture the values of E,G,F columns of every row you are going through.

As for saving the results, here are a few things you need to know about "openpyxl" module.
It has classes like "Workbook", "Worksheet", "Cell" etc. in it.

If "ws" is a "worksheet" object, then:

ws['V4']

is the cell object 'V4' in the worksheet.

The cell object has many attributes like "value", "fill", "comment" etc.

If you want to make changes to a cell, you set its attributes.
So, to set the cell's value, you set its "value" attribute. For example:

ws['V4'].value = 'Hello, World!'

will set the value of the cell 'V4' to 'Hello, World!'

Another example; the following:

ws['V4'].fill = PatternFill(bgColor="FF0000", fill_type = "solid")

will fill the red color in the cell 'V4'.

Now in your code, the following statement:

NG = ws[score_col_no + str(row_no)]

assigns the value of the cell object 'V4' to variable NG.
Edit: Sorry, this should read: "assigns the cell object 'V4' to variable NG."

However, this line:

NG = dict_pos[cpos]

sets the cell, not the cell's value.

Another thing => you need to recalculate "NG" inside the loop every time "row_no" increases.
Also => remove the second "if .. else" statement inside the "while" loop.

1 Like

Got it! So I have changed it to

if (dict_pos.has_key(cpos)):
NG.value = dict_pos[cpos]
else:
NG.value = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
    print NG

One last question, right now the code is reading rows which don't have values and ends up in a loop. For now I have the rows limited to 'while row_no <=10:' but different excel sheets have different row limits, some have 50, some have 100, how can I amend this ?

Awesome!
So the final step is just some tweaking in the "while <condition>" clause.
I had suggested the condition: "while row_no <= 10" so that you could see the "cpos" value for the first few rows (7 rows actually: row_no 4 through 10).
Otherwise, the "infinite loop" was printing lines too fast and it can be difficult to understand what's happening in that case.

The "infinite loop" problem still remains. The "row_no <= 10" was just a temporary patch.

Now, if you look at post # 36 in this thread, I had mentioned the working of a "while" statement.
The "while" statement repeats a set of actions while a condition is true.

So ask yourself this: in any worksheet, until when should I keep looking at the values of E, G, F cells and form the key and check against "dict_pos"?
If you were to update your worksheet manually, then up to what point would you go? When would you stop?
The answer to that question will decide what you want to put in the "while" condition.

If you ask me, I would do it manually until one of the cells E, G or F is empty. The moment one or more of them is empty, I would stop.
But I don't know what your requirements are.
Maybe you want to go until all the cells E, G, and F are empty.
So your condition will differ from mine.
Either way, you will have to work with using cell values in the while condition.
Have a look at post # 4 of this thread where I posted the first program.
In that program, I test the cell value in the while condition.
Use it to form your "while" condition.
Post your attempt or ask a question if you find it difficult to continue.

1 Like