Hi, everyone.
I have a file called data.txt that looks like this:
list1 A
list1 B
list1 C
list2 A
list2 B
list2 F
list2 H
list3 A
list3 B
list3 D
list4 A
list4 B
list5 H
list5 F
col 1 represents a virtual folder
col 2 represents the name of a book
"list2 A" means book A is added to list2
I want to find books that are in virtual folders that contain book A. the books found should be sorted in descending order by their appearing frequency. and A is not included in the result.
Re-opened as per OP request (not a homework assignment ...).
---------- Post updated at 03:08 PM ---------- Previous update was at 02:29 PM ----------
In a private conversation the OP explained that this is not a homework and that the original data is inside MySQL database.
Assuming the following data:
mysql> desc t;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| vf | varchar(10) | YES | | NULL | |
| b | varchar(10) | YES | | NULL | |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)
mysql> select * from t;
+-------+------+
| vf | b |
+-------+------+
| list1 | A |
| list1 | B |
| list1 | C |
| list2 | A |
| list2 | B |
| list2 | F |
| list2 | H |
| list3 | A |
| list3 | B |
| list3 | D |
| list4 | A |
| list4 | B |
| list5 | H |
| list5 | F |
+-------+------+
14 rows in set (0.00 sec)
If I'm not missing something, this should give you the desired output:
select a.b from t a, t b
where b.b = 'A'
and b.vf = a.vf
and a.b != 'A'
group by a.b
order by count(1) desc;
mysql> select a.b from t a, t b
-> where b.b = 'A'
-> and b.vf = a.vf
-> and a.b != 'A'
-> group by a.b
-> order by count(1) desc;
+------+
| b |
+------+
| B |
| C |
| F |
| H |
| D |
+------+
5 rows in set (0.00 sec)
OK! I think I didn't put it right in the private conversation.
I can get the result I want using SQL, but it is not efficient. you use "!=" in your select statement, that would kill the performance and it is definitely not possible to put this code into production use. I did say "we have a large table", I am sorry I didn't tell you how large it is. it is 50 million and it is growing!
I bet AWK can do much well here than SQL.
Thank you for your reply!
You didn't test your SQL with a table as large as 50 million rows. well you don't need to, but trust me, the time consumed for this kind of SQL to be executed is intolerable. I have tested a SQL like this "SELECT b, count(*) total FROM t GROUP BY vf ORDER BY total DESC" against a 50 million table, it took me 19 seconds.
The performance of such statements depends on many factors (the most important being the indexes).
Just for fun, this is an awk/sort/cut solution:
awk 'END { for (B in b) print b, B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
sort -nr |
cut -d\ -f2
You could add a final alphabetical ordering, if you wish.
Hi, radoulov
Today I have taken a look at an awk manual. I learned something, but I still can't quite understand the script you offered. Could you please explain it a little bit for me.
Why do you put "ARGV[ARGC++] = ARGV[ARGC-1]" in "BEGIN"?
And this line "NR == FNR { $2 == "A" && vf[$1]; next }". since there's only one input file, "NR == FNR" should always be true, right? so why it is necessary here? and what does "vf[$1]" mean(I guess it is collecting "virtual folders", but why it is written that way?)?
---------- Post updated 05-14-10 at 01:53 AM ---------- Previous update was 05-13-10 at 11:06 PM ----------
Cool! I have figured it out myself!
"ARGV[ARGC++] = ARGV[ARGC-1]" makes AWK process the input file twice.
"NR == FNR { $2 == "A" && vf[$1]; next }" is like "NR == FNR { if ($2 == "A") vf[$1]; next }" and "next" is like "continue" in the C programming language. "NR == FNR" means that only when the first time the input file is processed will "$1" be put in the "vf" array.