Selecting random, weighted records from MySQL
Your web application has become big and evil enough to display marketing messages. You want to show your users a message whenever they login. However the Marketing department gave you 3 different messages and wants you to show them at a 70%-20%-10% ratio, meaning:70% of users will see message 1, 20% will see message 2 and the remainder (10%) will see message 3.
You’ve also been tasked to create the back-end of the message distribution and want to create the simplest query that addresses the Marketing requirements. And off you go:
mysql> describe foo; +------------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+---------+------+-----+---------+-------+ | id | int(11) | NO | | NULL | | | grp | int(11) | YES | | NULL | | | likelihood | int(11) | YES | | NULL | | | message | text | YES | | NULL | | +------------+---------+------+-----+---------+-------+ 4 rows in set (0.00 sec)
This is your table that holds the messages. They are grouped by ‘grp’ and the ‘likelihood’ is expressed in numbers from 0-100 (0=message is not shown at all, 100=message is shown all the time).
You add some test records:
mysql> select * from foo order by id; +----+------+------------+---------------------+ | id | grp | likelihood | message | +----+------+------------+---------------------+ | 1 | 1 | 70 | Marketing Message 1 | | 2 | 1 | 20 | Marketing Message 2 | | 3 | 1 | 10 | Marketing Message 3 | | 4 | 2 | 90 | Sale Offer 1 | | 5 | 2 | 10 | Sale Offer 2 | +----+------+------------+---------------------+ 5 rows in set (0.00 sec)
And you craft the query that selects a weighted, random message from your table:
mysql> select message from foo where grp=2 order by likelihood * rand() desc limit 1; +--------------+ | message | +--------------+ | Sale Offer 1 | +--------------+ 1 row in set (0.00 sec)
You implement it and get a raise for being a smart guy. And they lived happily ever after …
November 7th, 2008 at 1:15 am
good thanks a lot
November 18th, 2008 at 11:34 am
This approach has a problem in that the output doesn’t match the expected frequencies. For example, you 90/10 split will actually occur about 95%/5%. Once you get three or more candidate rows, the numbers skew even more. I don’t know enough statistics to explain, just using empirical data. For example, try (1, 2, 5, 12), you would expect the first to occur 5%, but it’s like 0.5% or less.
If someone is paying for ad time, they are going to be peeved if the frequencies don’t come out right.
I’ve seen complex and statistically correct solutions involving summing all the weights and picking a random value in between, but I’m hoping there’s a way to fix the weighting formula (e.g. rand() / likelihood ASC – gives better results).
November 19th, 2008 at 4:11 am
Formula is:
ORDER BY -LOG(1.0 – RAND()) / likelihood
For safety use likelihood > 0.0 in WHERE clause.
See http://use.perl.org/~bart/journal/33630 for excellent analysis by someone way smarter than me.
It works on all distributions I’ve tested.
November 19th, 2008 at 2:00 pm
Buddy – thanks for the update. I’ll reference your comment in the original text.
Cheers – Tobias
March 28th, 2014 at 11:11 pm
[…] got this code from a brilliant comment here, and it’s a simple formula that allows you to randomly select from a database. The difference […]
August 29th, 2015 at 8:08 pm
kredyty pozabankowe
Selecting random, weighted records from MySQL | kahunaburger
April 12th, 2016 at 6:45 pm
I would like to point out the fact that Stack Exchange has a question which is pointing to this topic, but they “fixed” the correct solution which is ORDER BY -LOG(1.0 – RAND()) / likelihood by replacing the -LOG(1.0 – RAND()) with -LOG(RAND()) because some brilliant “contributor” figured out 1.0 – RAND() is the same as RAND(). That is WRONG. The CORRECT solution is 1.0 – RAND() otherwise you introduce the possibility of running LOG(0) since RAND() produces a value between 0 (inclusively) and 1 (exclusively), hence the need for 1.0 – RAND(). I tried correcting the SO answer but the “moderators” are preventing me from doing so. So anyone coming here from SO (as I did), make sure you use the correct solution.