Select record with MAX value in MySQL

chebarbudo · June 25, 2010, 5:28am

Hi there,

I have trouble selecting record that contain one biggest value for a group of other values. I mean, this is my table:

mysql> SELECT * FROM b;
+----+------+-------+
| id | user | value |
+----+------+-------+
|  1 |    1 |   100 |
|  3 |    1 |   150 |
|  5 |    1 |   300 |
|  6 |    2 |    10 |
|  7 |    2 |    80 |
|  8 |    2 |   880 |
+----+------+-------+

I want to SELECT one line per user. The one with the bigest value (in red).
I tried:

mysql> SELECT id, user, MAX(value) FROM b GROUP BY user;
+----+------+------------+
| id | user | MAX(value) |
+----+------+------------+
|  1 |    1 |        300 |
|  6 |    2 |        880 |
+----+------+------------+

But as you can see, it is completely wrong. Indeed it returns the biggest value per user but not the corresponding record.
How can I do that?
Thanks for your help
Santiago

radoulov · June 25, 2010, 6:53am

This is an expected behavior (see this article).

You could use something like this:

select b.id, b.user, b.value 
from b, (
  select user, max(value) value 
  from b
  group by user
  ) m
where b.value = m.value
and b.user = m.user;

chebarbudo · June 29, 2010, 6:39am

Thanks radoulov,
It takes a while to process the nested query with 350,000 records but it looks likes this is the only way.

radoulov · June 29, 2010, 6:47am

How did you define the indexes on the table?
Could you also post the explain [plan] of the statement?

chebarbudo · June 29, 2010, 11:30am

thanks radoulov for your interest.
As you can imagine, my real problem is far more complex than a tables called b with just 2 field.

In the real world, I have a big table containing a huge number of uploaded files. All files were archived (tgz) and then uploaded. So I have I directory loaded with tons of tgz files. Some of those tgz actually contain the same file but uploaded at two different times. I will keep all files to be able to recover versions. But I'm first interested in the list of files (not tgz files but the files they contain) in their latest status (mod, owner, group). I hope this is clear enough.

This is my real table:

mysql> DESCRIBE uploaded;
+------------+------------------+------+-----+---------+----------------------------------------------+
| Field      | Type             | Null | Key | Default | Extra                                        |
+------------+------------------+------+-----+---------+----------------------------------------------+
| id         | int(10) unsigned | NO   | PRI | NULL    | auto_increment                               |
| tgz_path   | varchar(512)     | NO   |     | NULL    | path to a tgz file                           |
| file_mod   | varchar(9)       | YES  |     | NULL    | mod of the file inside the tgz               |
| file_owner | varchar(32)      | YES  |     | NULL    | owner of the file inside the tgz             |
| file_group | varchar(32)      | YES  |     | NULL    | group of the file inside the tgz             |
| file_time  | datetime         | YES  |     | NULL    | modification time of the file inside the tgz |
| file_name  | varchar(512)     | NO   |     | NULL    | path of the file inside the tgz              |
+------------+------------------+------+-----+---------+----------------------------------------------+

And this is a sample of what it contains:

+----+---------------------+-----------+------------+------------+---------------------+--------------------------+
| id | tgz_path            | file_mod  | file_owner | file_group | file_time           | file_name                |
+----+---------------------+-----------+------------+------------+---------------------+--------------------------+
|  1 | /home/ftp/18027.tgz | rw-r----- | root       | root       | 2010-06-28 10:11:51 | /home/santiago/doc.txt   |
|  2 | /home/ftp/8400.tgz  | rw-r--r-- | santiago   | santiago   | 2010-06-29 17:37:29 | /home/santiago/doc.txt   |
|  3 | /home/ftp/18695.tgz | rw-rw-r-- | root       | root       | 2010-06-27 11:42:56 | /usr/local/bin/script.sh |
|  4 | /home/ftp/27750.tgz | rwxr-x--- | foo        | foo        | 2010-06-29 00:53:32 | /usr/local/bin/script.sh |
|  5 | /home/ftp/4333.tgz  | rwxrwxr-x | foo        | foo        | 2010-06-28 09:17:14 | /usr/local/bin/script.sh |
+----+---------------------+-----------+------------+------------+---------------------+--------------------------+

As you can see, files have changed mod and owner over time. I highlighted in red the latest record per file.

Then I have two other tables: files and versions

mysql> DESCRIBE files;
+------------+------------------+------+-----+---------+----------------+
| Field      | Type             | Null | Key | Default | Extra          |
+------------+------------------+------+-----+---------+----------------+
| id         | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| file_mod   | varchar(9)       | NO   |     | NULL    |                |
| file_owner | varchar(32)      | NO   |     | NULL    |                |
| file_group | varchar(32)      | NO   |     | NULL    |                |
| file_time  | datetime         | NO   |     | NULL    |                |
| file_name  | varchar(512)     | NO   |     | NULL    |                |
+------------+------------------+------+-----+---------+----------------+
mysql> DESCRIBE versions;
+-----------+------------------+------+-----+---------+----------------+
| Field     | Type             | Null | Key | Default | Extra          |
+-----------+------------------+------+-----+---------+----------------+
| id        | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| file_id   | int(10) unsigned | NO   |     | NULL    |                |
| file_time | datetime         | NO   |     | NULL    |                |
| path      | varchar(512)     | NO   |     | NULL    |                |
+-----------+------------------+------+-----+---------+----------------+

I want to save in table files the latest record per file. I first INSERT new record:

mysql> INSERT IGNORE INTO files (file_mod, file_owner, file_group, file_time, file_name)
SELECT file_mod, file_owner, file_group, file_time, file_name
FROM uploaded
GROUP BY file_name;

and then UPDATE records with the latest statistics. That's where my query comes:

mysql> UPDATE files JOIN (
    SELECT uploaded.file_mod, uploaded.file_owner, uploaded.file_group, uploaded.file_time, uploaded.file_name
    FROM uploaded
    JOIN (SELECT MAX(file_time) AS file_time, file_name
          FROM uploaded
          GROUP BY file_name) AS max_uploaded ON uploaded.file_time = max_uploaded.file_time AND uploaded.file_name = max_uploaded.file_name
    GROUP BY uploaded.file_time, uploaded.file_name) AS uploaded ON files.file_name = uploaded.file_name
SET files.is_folder  = uploaded.is_folder,
    files.file_mod   = uploaded.file_mod,
    files.file_owner = uploaded.file_owner,
    files.file_group = uploaded.file_group,
    files.file_time  = uploaded.file_time
WHERE uploaded.file_time > files.file_time;

That's it! Ask me if it's too messy.
Now you're right, to put some well choosen will probably blindingly increase the speed of the query. What do you think?

pludi · June 29, 2010, 12:12pm

From what I see, I'd put an index at least on uploaded.file_name and uploaded.file_time, and maybe files.file_name. You could run an EXPLAIN on the SELECT part, as that should show the biggest bottlenecks (probably the JOIN ON).

chebarbudo · June 29, 2010, 12:51pm

Thanks for the advice.
You look pretty good with MySQL.
Could you answer this question?
If I create a unique index based on two columns:

ALTER TABLE t ADD UNIQUE INDEX unique_key (f1, f2)

Does it consequently indexes column f1 and column f2 or do I have to explicitly do:

ALTER TABLE t ADD UNIQUE INDEX unique_key (f1, f2), ADD INDEX (f1), ADD INDEX (f2)

pludi · June 29, 2010, 1:03pm

It's been a while since I've spoken MySQL... but I'd do it this way

CREATE UNIQUE INDEX IX_uploaded_name_time ON uploaded ( file_name, file_time )

But the ALTER TABLE syntax (the first one) should get you the same result. If not, you can always drop and re-create the index.

It would be interesting tho to see the comparison of the execution timing and EXPLAIN plan with and without the indices.

And if you want to optimize further, here's an article on how to use the MySQL profiler.

radoulov · June 29, 2010, 3:39pm

Well,
first of all, I think, you should rethink the relational design.

chebarbudo · June 29, 2010, 5:33pm

Thanks radoulov for this note but could you be more precise or at least give an example.

radoulov · June 30, 2010, 3:47am

OK, for example, you could try to normalize the data structures at least to the 1 normal form:

You should take out the owners and the groups in a separate structures - so you'll end up with:

users                  

uid pk
username

groups

gid pk
group

May one archive contain more than one file? So:

paths

path_id pk
path

And so on ..., the permission mask is also a candidate because of repeating data.

Then you could create the files table with all attributes referenced by the relative ids.

Why do you want a separate tables for files and versions? You could have all the information in one table: filename, path_id, uid, gid, mod_time etc.

If you need to know the most recent versions at any time, you could create a view that selects them.

Just my 0.02�.

chebarbudo · June 30, 2010, 6:02am

Thanks radoulov, I got your point now.
I decided on purpose to allow the table uploaded to contain a lot of redundant data.
Why?
Datas are meant to go through the table uploaded but not stay in it. Once a line is added to the table, it goes through various processes and is then deleted. It would take ages to explain it all but here are the basics:

proftpd adds a row each time a file is uploaded
some process work line by line like the one that extract informations from tgz because it implies working with the filesystem.
some processes work as updates of numerous lines at the same time like those that retrieve file_id, file_owner_id, file_mod_id, etc.
and there's more...
until the entry is deleted

So basically, there is a process that update lines one by one with as few operations as I can. Actually I just parse tar -tzvvf "$path".
Then the other processes can work much faster by adding unknown file_owner and file_mod and then updating the whole table at once with the corresponding ids.
This view might help you understand:

+----+---------------------+-----------+--------------+------------+----------------+------------+----------------+---------------------+--------------------------+---------+--------+
| id | tgz_path            | file_mod  | file_mod_id  | file_owner | file_owner_id  | file_group | file_owner_id  | file_time           | file_name                | file_id | status |
+----+---------------------+-----------+--------------+------------+----------------+------------+----------------+---------------------+--------------------------+---------+--------+
|  1 | /home/ftp/18027.tgz | rw-r----- |           56 | root       |             34 | root       |             34 | 2010-06-28 10:11:51 | /home/santiago/doc.txt   |   94516 |      2 |
|  2 | /home/ftp/8400.tgz  | rw-r--r-- |           12 | santiago   |              3 | santiago   |              3 | 2010-06-29 17:37:29 | /home/santiago/doc.txt   |   94516 |      2 |
|  3 | /home/ftp/18695.tgz | rw-rw-r-- |              | root       |                | root       |                | 2010-06-27 11:42:56 | /usr/local/bin/script.sh |         |      1 |
|  4 | /home/ftp/27750.tgz | rwxr-x--- |              | foo        |                | foo        |                | 2010-06-29 00:53:32 | /usr/local/bin/script.sh |         |      1 |
|  5 | /home/ftp/4333.tgz  |           |              |            |                |            |                |                     |                          |         |      0 |
+----+---------------------+-----------+--------------+------------+----------------+------------+----------------+---------------------+--------------------------+---------+--------+

line 5 has just been added by proftpd, it has gone through no processes (status 0)
line 3 and 4 have gone through the process that reads info from tgz files (status 1)
line 1 and 2 have gone through the process that retrieves ids (status 2)
the next process add the record to the table version and delete it

To finish with your notes, I decided to separate table files and versions exactly for the reason you mentioned. If it was all in one table, the field file_path (255 char) would be repeated very often (because a lot of files exist in many versions).
The whole system is a backup facility. The table file records the address of each file with their latest mod/owner/group information (because one should not be able to retrieve an old version only because once in the past he had access to the file). The table versions records where is located each version of each files, no need to repeat file_path, file_mod and file_owner here.

Your remarks are a great help for me, don't hesitate to come back on what I said.

radoulov · June 30, 2010, 7:02am

I understand,
why don't you just update the files table at the end of each file processing? A single record update vs scheduled mass update.

P.S. If I understand correctly the last time update in the files table is the slowest part, right?

chebarbudo · June 30, 2010, 7:23am

Well my idea is that if I retrieved ids one by one, I would run millions of queries:

for each record; do
    UPDATE uploaded with tar -tzzf "$path" WHERE id = $id
    INSERT INTO IGNORE mods (file_mod) VALUES (the current mod)
    INSERT INTO IGNORE owner (file_owner) VALUES (the current owner)
    INSERT INTO IGNORE files (file_path) VALUES (the current path)
    UPDATE uploaded JOIN mods, owner, files SET mod_id =..., owner_id=... WHERE id = $id
done

This is no scripting language, it just a explanation of what it does.
This means that I will run 5 queries for each record.
Instead, I do that:

for each record; do
    UPDATE uploaded with tar -tzzf "$path" WHERE id = $id WHERE id = $id
done
# once it's all done
INSERT INTO IGNORE mods (file_mod) SELECT DISTINCT mod FROM uploaded
INSERT INTO IGNORE owner (file_owner) SELECT DISTINCT owner FROM uploaded
INSERT INTO IGNORE files (file_path) SELECT DISTINCT file_path FROM uploaded
UPDATE uploaded JOIN mods, owner, files SET mod_id =..., owner_id=...

Which means I run 5 times less queries + 4.
Does it make sens?

radoulov · June 30, 2010, 7:28am

I guess I don't understand ...
I mean, at this stage:

change it to:

chebarbudo · June 30, 2010, 7:39am

The last process:
o INSERT or UPDATE the table files (with latest mod/owner) that was my problem and the reason why I started this thread
o INSERT into the table versions
o DELETE from table uploaded

radoulov · June 30, 2010, 8:11am

I guessed right
My idea was: if you have this information earlier (status 1 and 2),
just propagate it as soon as you can.

chebarbudo · June 30, 2010, 8:30am

Currently, I run 2 queries per record and 24 general queries.

Are you telling me I should process lines one by one for every operation which means running 26 queries per record?

With an average of 50 files received per minutes, do you confirm I'd better run
6,500 queries every 5 minutes
rather than
524 queries every 5 minutes?

radoulov · June 30, 2010, 8:48am

No,
or better still, it depends.

I thought that you were processing the data record by record and at the end you were mass updating the modification time in the files table.

If the modification time update cannot be placed in those one-by-one queries, it's probably better to avoid the approach I suggested.

But as I said it depends.

Initially the problem was the wrong operation (because of the partial group by supported by MySQL), after that the statement was fixed.

Now it's about performance.

If the elapsed of the massive update (single statement -> many rows) is critical, even doing more transactions at a different time (row-by-row, 100 statements for 100 rows) could make sense.

Hope this helps.

chebarbudo · June 30, 2010, 10:26am

Time is indeed an issue given the fast growing number of uploaded files.
I isolated one operation: retrieve the file id on 300,000 records.
With a massive update, it takes about 75 seconds.
With a one by one update, I started clocking more than an hour ago and it's not even finished.

Anyway, thank you very much for helping me go deeper in the structure of my database. It was very interesting.

Cheers
Santiago

---------- Post updated at 15:11 ---------- Previous update was at 15:06 ----------

Usually there's a way to reward people who help with bits.
I can't find the button anymore. Do you know where it is?

---------- Post updated at 15:24 ---------- Previous update was at 15:11 ----------

I stopped the query because I needed my CPU ;o)
Trying to extrapolate the number of records already processed, I evaluated that the one by one update is around 2,000 times longer than the massive update.

---------- Post updated at 15:26 ---------- Previous update was at 15:24 ----------

I forgot to mention that I think the incredible difference is certainly due to the well chosen indexes you told me to create.