awk problem

kevintse · May 13, 2010, 8:18am

Hi, everyone.
I have a file called data.txt that looks like this:

list1 A
list1 B
list1 C
list2 A
list2 B
list2 F
list2 H
list3 A
list3 B
list3 D
list4 A
list4 B
list5 H
list5 F

col 1 represents a virtual folder
col 2 represents the name of a book
"list2 A" means book A is added to list2

I want to find books that are in virtual folders that contain book A. the books found should be sorted in descending order by their appearing frequency. and A is not included in the result.

Thank you!

radoulov · May 13, 2010, 9:08am

Re-opened as per OP request (not a homework assignment ...).

---------- Post updated at 03:08 PM ---------- Previous update was at 02:29 PM ----------

In a private conversation the OP explained that this is not a homework and that the original data is inside MySQL database.

Assuming the following data:

mysql> desc t;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| vf    | varchar(10) | YES  |     | NULL    |       |
| b     | varchar(10) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> select * from t;
+-------+------+
| vf    | b    |
+-------+------+
| list1 | A    |
| list1 | B    |
| list1 | C    |
| list2 | A    |
| list2 | B    |
| list2 | F    |
| list2 | H    |
| list3 | A    |
| list3 | B    |
| list3 | D    |
| list4 | A    |
| list4 | B    |
| list5 | H    |
| list5 | F    |
+-------+------+
14 rows in set (0.00 sec)

If I'm not missing something, this should give you the desired output:

select a.b from t a, t b
where b.b = 'A'
and b.vf = a.vf
and a.b != 'A'
group by a.b
order by count(1) desc;

mysql> select a.b from t a, t b
    -> where b.b = 'A'
    -> and b.vf = a.vf
    -> and a.b != 'A'
    -> group by a.b
    -> order by count(1) desc;
+------+
| b    |
+------+
| B    |
| C    |
| F    |
| H    |
| D    |
+------+
5 rows in set (0.00 sec)

kevintse · May 13, 2010, 9:17am

OK! I think I didn't put it right in the private conversation.
I can get the result I want using SQL, but it is not efficient. you use "!=" in your select statement, that would kill the performance and it is definitely not possible to put this code into production use. I did say "we have a large table", I am sorry I didn't tell you how large it is. it is 50 million and it is growing!

I bet AWK can do much well here than SQL.
Thank you for your reply!

radoulov · May 13, 2010, 9:19am

Why do you think that the != operator would impact the performance ... ?

It has nothing to do with it ....

If you index the table correctly, the SQL solution will be the fastest.

Just my 0.02� ...

kevintse · May 13, 2010, 9:30am

You didn't test your SQL with a table as large as 50 million rows. well you don't need to, but trust me, the time consumed for this kind of SQL to be executed is intolerable. I have tested a SQL like this "SELECT b, count(*) total FROM t GROUP BY vf ORDER BY total DESC" against a 50 million table, it took me 19 seconds.

radoulov · May 13, 2010, 9:36am

The performance of such statements depends on many factors (the most important being the indexes).

Just for fun, this is an awk/sort/cut solution:

awk 'END { for (B in b) print b, B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
    cut -d\  -f2

You could add a final alphabetical ordering, if you wish.

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

Anyway, you've been warned ...

kevintse · May 14, 2010, 2:53am

radoulov:

Just for fun, this is an awk/sort/cut solution:
awk 'END { for (B in b) print b, B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
   cut -d\  -f2  
You could add a final alphabetical ordering, if you wish.

Hi, radoulov
Today I have taken a look at an awk manual. I learned something, but I still can't quite understand the script you offered. Could you please explain it a little bit for me.

Why do you put "ARGV[ARGC++] = ARGV[ARGC-1]" in "BEGIN"?
And this line "NR == FNR { $2 == "A" && vf[$1]; next }". since there's only one input file, "NR == FNR" should always be true, right? so why it is necessary here? and what does "vf[$1]" mean(I guess it is collecting "virtual folders", but why it is written that way?)?

---------- Post updated 05-14-10 at 01:53 AM ---------- Previous update was 05-13-10 at 11:06 PM ----------

Cool! I have figured it out myself!

"ARGV[ARGC++] = ARGV[ARGC-1]" makes AWK process the input file twice.
"NR == FNR { $2 == "A" && vf[$1]; next }" is like "NR == FNR { if ($2 == "A") vf[$1]; next }" and "next" is like "continue" in the C programming language. "NR == FNR" means that only when the first time the input file is processed will "$1" be put in the "vf" array.

DONE!
Thank you again for the code.

radoulov · May 14, 2010, 2:54am

Correct