Python & python & sqlalchemy: how to pass data from mysql to python?

visitor123 · April 16, 2024, 2:03pm

Hello,
is there experience user who can help here? I tried to make selection from mysql. I prepared the query and tested it in phpmyadmin (the sql querry was logged by mysql to general log file, so I just copied the query and repeated the result. So phpmyadmin had no problem to display the sid, stop_name results.)

So this is what I tried:

def process_chunk(chunk, engine):
    chunk['route_id'] = chunk['route_id'].apply(lambda x: x.replace('-CISR-', ''))
    route_ids = chunk['route_id'].unique()    
    query = "SELECT DISTINCT trips.sid, stop_names.stop_name FROM trips LEFT JOIN stop_names ON trips.sid = stop_names.sid WHERE trips.route_id IN %s"    
    chunk_result = pd.read_sql_query(query, engine, params=[tuple(route_ids)])
    for index, row in chunk_result.iterrows():
        needle = row['stop_name']

The needle should contain name of bus stop in a town like "Praha Vinohrady, ...." but it contains value 'stop_name' instead.
Originally I tried it without mysql alchemy and engine, but directly with connection.

I cannot find out what I am doing wrong. The screenshot is from Visual Studio Code. I have no idea how the stop_name value gets there instead the result from mysql.

screenshot

try:
    engine = create_engine('mysql+pymysql://', creator=lambda: mydb)
    print("Engine created")
except Exception as e:
    print("Error:", e)

bendingrodriguez · April 16, 2024, 4:57pm

Hi @visitor123,

route_ids has to be a list or tuple of strings or numbers. Is that the case? Also, params normally has to be a tuple or a dict, and not a list - but that is driver dependant (note that a tuple containing one element only has to be written as (elem,), i.e. with a trailing comma).

visitor123 · April 17, 2024, 12:14pm

Thank you for your response.
Lot's of things changed.
My original goal was to use sqlalchemy on pandas' dataframe. But when I try to execute the query (or session.statement) I got error.

raise TypeError("Query must be a string unless using sqlalchemy.")
TypeError: Query must be a string unless using sqlalchemy.

This is current problem. But it does not make sense to use sqlalchemy on pandas dataframe if it uses text and not ORM objects. I could just run any SQL querry. That sucks.

        session.statement = session.query(Trip.sid, StopName.stop_name).select_from(Trip).join(StopName, Trip.sid == StopName.sid).filter(Trip.route_id.in_(route_ids)).distinct() # probably invalid query, have to be fixed
        chunk_result = pd.read_sql(session.statement, session.bind)

Here is complete code:


from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, DECIMAL, ForeignKey, text
from sqlalchemy.orm import sessionmaker, relationship
from sqlalchemy.ext.declarative import declarative_base
import pandas as pd
import zipfile
import io
import re

Base = declarative_base()

class Stop(Base):
    __tablename__ = 'stops'

    sid = Column(Integer, primary_key=True)
    stop_lat = Column(DECIMAL(10, 8))
    stop_lon = Column(DECIMAL(11, 8))
    dopravce = Column(String(255))
    id = Column(Integer)

class StopName(Base):
    __tablename__ = 'stop_names'

    sid = Column(Integer, primary_key=True, autoincrement=True)
    stop_name = Column(String(255), unique=True)

class Trip(Base):
    __tablename__ = 'trips'

    route_id = Column(String(255))
    trip_id_1 = Column(Integer, primary_key=True)
    trip_id_2 = Column(Integer, primary_key=True)
    trip_id_3 = Column(Integer, primary_key=True)
    direction_id = Column(Integer)
    sid = Column(Integer, ForeignKey('stops.sid'))

    stop = relationship("Stop")

password=""
db="trafic"
try:
    engine = create_engine(f"mysql+pymysql://admin:{password}@localhost/{db}")
    print("Engine OK.")
except Exception as e:
    print("Engine error:", e)

try:
    connection = engine.connect()
    print("Conn. OK.")
except Exception as e:
    print("Conn. FAIL.", e)

Base.metadata.create_all(engine)

def process_and_insert_data(session, data):
    for chunk in data:
        chunk['route_id'] = chunk['route_id'].str.replace('-CISR-', '')
        route_ids = chunk['route_id'].unique()
        
        session.statement = session.query(Trip.sid, StopName.stop_name).select_from(Trip).join(StopName, Trip.sid == StopName.sid).filter(Trip.route_id.in_(route_ids)).distinct()
        chunk_result = pd.read_sql(session.statement, session.bind)
        
        for row in chunk_result:
            needle = row[1]
            for index, route_long_name in chunk['route_long_name'].items():
                pattern = re.compile(r'\b' + re.escape(needle) + r'\b')
                if re.search(pattern, route_long_name):
                    chunk.at[index, 'route_long_name'] = re.sub(pattern, f"#{row[0]}", route_long_name)
                else:
                    city_name = needle.split(',')[0].strip()  
                    city_pattern = re.compile(r'\b' + re.escape(city_name) + r'\b')
                    if re.search(city_pattern, route_long_name):
                        chunk.at[index, 'route_long_name'] = re.sub(city_pattern, f"#{row[0]}", route_long_name)
        
        chunk.to_sql(name='routes', con=engine, if_exists='append', index=False)

zip_path = "/mnt/ramdisk/JDF_merged_GTFS.zip"
chunksize = 10

with zipfile.ZipFile(zip_path, 'r') as zip_file:
    with zip_file.open('routes.txt', 'r') as routes_file:
        data = pd.read_csv(io.TextIOWrapper(routes_file, 'utf-8'), chunksize=chunksize)
        Session = sessionmaker(bind=engine)
        with Session() as session:
            process_and_insert_data(session, data)

engine.dispose()

bendingrodriguez · April 17, 2024, 3:36pm

Hi @visitor123,

the pandas doc says that the 1st param of read_sql() has to be a str or SQLAlchemy Selectable. So you have to use either a SQL string as before, or a selectable.

Session = sessionmaker(bind=engine)
with Session() as session:

That looks a bit strange,

from sqlalchemy.orm import Session
...
engine = ...
...
with Session(engine) as session:

should be the simpliest form, see e.g. Session Basics — SQLAlchemy 2.0 Documentation.

Also, the code within the for row loop seems a bit cumbersome, but I didn't take a closer look at it.

visitor123 · April 17, 2024, 4:14pm

In the manual I see some example like this:

from sqlalchemy import join

j = user_table.join(address_table,
                user_table.c.id == address_table.c.user_id)
stmt = select([user_table]).select_from(j)

there is the select object but it is not defined.

It seem like join is needed to be imported (1st char is small).

from sqlalchemy import create_engine, MetaData, Table, Column, join, Integer, String, DECIMAL, ForeignKey, text

But the error, select object is not defined:

        selectable_query = select([Trip.sid, StopName.stop_name]).select_from(Trip).join(StopName, Trip.sid == StopName.sid).filter(Trip.route_id.in_(route_ids)).distinct()
        chunk_result = pd.read_sql(selectable_query, session.bind)

visitor123 · April 17, 2024, 7:11pm

In Visual Studio Code in .venv I have upgrade the python and pandas.

pip install --upgrade pandas sqlalchemy

Now it works with session.statement.selectable

But this does not work in terminal.

As a user I tried:
$ pip install --upgrade pandas sqlalchemy
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /home/user/.local/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: sqlalchemy in /home/user/.local/lib/python3.10/site-packages (2.0.29)
Requirement already satisfied: numpy>=1.22.4 in /home/user/.local/lib/python3.10/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/user/.local/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas) (2022.1)
Requirement already satisfied: tzdata>=2022.7 in /home/user/.local/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: typing-extensions>=4.6.0 in /home/user/.local/lib/python3.10/site-packages (from sqlalchemy) (4.10.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/lib/python3/dist-packages (from sqlalchemy) (1.1.2)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

Also I would like to try this (from the manual):
sqlalchemy.sql.expression.select(columns=None, whereclause=None, from_obj=None, distinct=False, having=None, correlate=True, prefixes=None, suffixes=None, **kwargs)¶
With distinct=True.
But I got error select is undefined.
The code from ChatGPT:

selectable_query = select([Trip.sid, StopName.stop_name], distinct=True).select_from(Trip).join(StopName, Trip.sid == StopName.sid).filter(Trip.route_id.in_(route_ids)).distinct()

bendingrodriguez · April 18, 2024, 6:14am

upper or lower case is not a condition for whether a function (or any other object) needs to be imported or not. All objects except builtin objects/functions/keywords must be imported, from the standard lib (like os, csv etc) or a locally installed lib (like pandas).

So, in order to use select and join, you have to import them:

from sqlalchemy import select, join
# or, more specified (preferable)
from sqlalchemy.sql.expression import select, join
...
foo = select(bar)

Or you import the complete lib, but then the object path must be specified on each call:

import sqlalchemy
...
foo = sqlalchemy.select(bar)
# or, more specified (preferable)
foo = sqlalchemy.sql.expression.select(bar)

Both methods do not differ in their performance, it is just a question of coding style - so there is no general 'better' or 'worse' method.

Note that libs and modules are not the same things, see e.g. What is a Python module? How is it different from libraries?.

visitor123 · April 18, 2024, 8:02am

Thank you. I'm reading your response.
When I run in normal non root environment this command
pip install --upgrade sqlalchemy
I got this:

Defaulting to user installation because normal site-packages is not writeable
sqlalchemy needs upgrade.

I've update the sqlalchemy using root. Never mind. The main thing now it runs already. However. I am not sure with the correct solution for my query. So here I will send one more time the query on ChatGPT and translation from my native language... I includes the answer, it just mixed it in one post.

visitor123 · April 18, 2024, 9:25am

To add the join attribute to the Trip table and perform a connection with the stop_names table through the relationship on the sid column, we can use the relationship property. Here is the updated code:

python

from sqlalchemy.orm import relationship

# Definition of the model for the trips table
class Trip(Base):
    __tablename__ = 'trips'

    route_id = Column(String(255))
    trip_id_1 = Column(Integer, primary_key=True)
    trip_id_2 = Column(Integer, primary_key=True)
    trip_id_3 = Column(Integer, primary_key=True)
    direction_id = Column(Integer)
    sid = Column(Integer, ForeignKey('stops.sid'))

    # Adding the join attribute
    stop_name = relationship("StopName", primaryjoin='Trip.sid == StopName.sid', uselist=False)

In this update, we added the stop_name attribute to the Trip class, which creates a relationship between the Trip and StopName tables using the sid key. Setting uselist=False ensures that the stop_name attribute will be a one-to-one relationship (i.e., one Trip record will correspond to one record in the StopName table).

However, it seems you encountered an error: AttributeError: type object 'Trip' has no attribute 'join'

python

selectable_query = select([Trip.sid, StopName.stop_name], from_obj=Trip.join(StopName, Trip.sid == StopName.sid).where(Trip.route_id.in_(route_ids), distinct=True))

It is important to understand that first, the route_id should be searched in the routes table, with only one row returned for each route_id, a total number of rows, which will be set to a test chunksize of 10, later chunksize 1000. So during the test, 10 rows and 10 different route_ids should be returned, then a left join should be performed... select stop_name from stop_names where it is possible that the resulting city names will be repeated. So I will have 10 different routes for 10 different trips (different bus lines) where the stop name will be for example "Ostrava, Bus Station" (10 identical results in the final output) must be allowed.

I understand that you need to execute a combination of select and join with the distinct condition to obtain unique combinations from the Trip and StopName tables based on route_id. Here is the updated code you can use:

python

from sqlalchemy import distinct

# Definition of the model for the trips table
class Trip(Base):
    __tablename__ = 'trips'

    route_id = Column(String(255))
    trip_id_1 = Column(Integer, primary_key=True)
    trip_id_2 = Column(Integer, primary_key=True)
    trip_id_3 = Column(Integer, primary_key=True)
    direction_id = Column(Integer)
    sid = Column(Integer, ForeignKey('stops.sid'))

    # Adding the join attribute
    stop_name = relationship("StopName", primaryjoin='Trip.sid == StopName.sid', uselist=False)

# Creating a query with a combination of select, join, and distinct
selectable_query = select([Trip.route_id, StopName.stop_name]).select_from(Trip).join(StopName, Trip.sid == StopName.sid).where(Trip.route_id.in_(route_ids)).distinct()

This code creates a query with a combination of select and join, where the results are filtered by route_id and are also unique due to the use of distinct(). I hope this resolves your issue.

PLEASE NOTE: THIS CODE CAN BE COMPLETLY WRONG, IT'S LLM GENERATED.

bendingrodriguez · April 18, 2024, 12:01pm

Hi @visitor123,

I would be careful with LLM generated code, especially if you want to learn a programming language and don't have that much experience yet. Tutorials (e.g. https://realpython.com) or, as in times past, books are much more suitable for this.

It looks that you are complicating things a bit. To execute 'normal' DB queries such as select, insert or update, normal SQL code, i.e. a plain string, is sufficient.

sqlaclchemy or peewee are ORMs, which can be used to map DB tables, queries, relationships etc. to objects in the respective programming language, like class <tablename>(Base) or Column(Integer) in Python. Libs like python3-mysqldb (Debian/Ubuntu) or mysql-connector-python (pip) are more easy to use resp. less complex.

You probably don't need pandas either. It's a huge lib that is primarily intended for numerical data analysis and processing, providing its own data structures, like DataFrame.

Note that you can return selected data as a list/tuple, i.e. a sequence, or as a dict, i.e. a mapping. In the former case you have to know the position of a column that you want to access, in the latter form you have to know its name:

# assuming that select() only returns one row
# via list/tuple
row = db.select(sql); print(row[0])  # 1st column
# via dict
row = db.select(sql); print(row['mycolumn'])

That type can be set in the DB connection resp. cursor initialization. Which type you use is more or less a matter of personal preferences and readability.

I would therefore recommend that you first learn some basics, then you can move on to DB queries, see e.g. here: Python and MySQL Database: A Practical Introduction – Real Python.

visitor123 · April 18, 2024, 12:57pm

Thank you. I sticked to pandas. I gonna do analysis. The pandas is that LLM selected. I originally used pymysql but it was not able to work with dataframes. I want to learn this. But I learn in different way. This is something I am sure and am aware of very well, confirmed many years ago. My mind just doesn't remember such huge about of information, so the way the LLM does it is just fine. Except I need to correct it many times but still this learning is much more effective for me because if I would work more then 14 days on a project I would completely lost a will to learn anything. Well, LLM is great for this. In fact, I wouldn't be linux user if I would not use it. Only because of it I am writing to this site. The change ways 04/2023 when I first started learning linux few tries years earlier. The great think on LLM is it can give code immediately so one can do the job now, and not one week later.

The bad thing of LLM is that it really doesn't know how to implement sqlalchemy correctly.

bendingrodriguez · April 19, 2024, 4:27pm

pymysql returns rows as lists/tuples or dicts, as said. Of course you can use that data structures with pandas in both directions:

# tuple to df
row = ("a", "b")
df = DataFrame(row)
# and vice versa (a bit ugly)
print(tuple(df.to_numpy().flatten()) == row)
# dict to df
row = {"col0": "a", "col1": "b"}
df = DataFrame(row.items())
# and vice versa (not so ugly)
print(dict(df.values) == row)

Well, a LLM doesn't understand anything, obviously. It just has to make sure that its non-deterministic output looks plausible.

visitor123 · April 23, 2024, 9:35pm

I am close to finish my program. I was making it OOP because the things I needed to gather - I need to be perfectly sure I collect the right information. So it takes some time. In meantime I was a little bit bored because it took so long to complete my objects.