Skip navigation.


Work Log

Finally seem to have found the another bottle neck.  In the calculation of the Pearson correlation each time getItemsNumber was called -- that calculated the total number of distict items by doing the full database scan (which was rather heavy).  Cached the number in the dataModel (since the number of items does not change).  Now the cross validation takes only 1,500 ms; comparing to original 150,000 ms (optimized a number of different things: neighborhood construction, correlation caching, etc.).  Thats a nice 100x speed up.  I am pretty happy about that.  So no need to optimieze this part of cross validation any longer .



Open Source Funding Model

Recently I was thinking about possibilities for funding the open source project.  An approach that I came up with could be an addition to the standard funding models is lets call it a delayware.  I did a quick search on google to see if somebody has already thought of that, and that does not seem to be the case (or I am using the wrong keywords to search)

The essense of the idea is following:

1) Paying customers get the newest version of the product

2) Non-paying or non-contributing customers get the previous version of the product.



copying bookmarks from to diigo
keywords: diigo importing exporting migrating bookmarks

My new favourite application for bookmarking is diigo.  So I decided to copy my bookmarks from

Go to:
make sure to check: include my tags, include my notes

Then go to:
you may want to add an optional tag e.g. delicious
and you can choose the privacy level as well.

Worked like a charm.

To take the full advantage of diigo makesure to download the toolbar:


Work Log

Whole this week I am working on preparing the presentation.

FON Admin

fon ip

user name (username): admin

password: admin


status page

Work Log



Optimized correlation caching

before cache miss 0.04% -> now 0.02%

Nonetheless it takes 80% of total time

TODO: This is the place for optimization


After the optimizations constructing user neighborhood now only takes 10% comparing to 80% before.

80% of time is now taken by calculating user correlation

 After correlation optimizations



Investigating why some parts take too long

system running...
Apr 9, 2008 1:12:17 AM <init>
INFO: log running..
Apr 9, 2008 1:12:17 AM <init>
INFO: {db_file=/home/neil/tmp/ml1m.csv, AL_Method=random, userNum=1, usrNeighborhoodSize=100, itemNum=100, alItemNum=3}
Apr 9, 2008 1:12:17 AM <init>
INFO: Creating db indexes ...
Apr 9, 2008 1:12:17 AM <init>
INFO: loading data ...
Apr 9, 2008 1:12:58 AM <init>
INFO: loaded data
Apr 9, 2008 1:12:59 AM evaluate
INFO: performing evaluation
Apr 9, 2008 1:12:59 AM com.planetj.taste.impl.neighborhood.CachingALUserNeighborhood init
INFO: Cache cleared.
Apr 9, 2008 1:14:06 AM getTestUsers
INFO: Selected userID: 4257
Apr 9, 2008 1:14:06 AM evaluate
INFO: TestUsers: [User[id:4257]]
Apr 9, 2008 1:14:06 AM evaluate
INFO: AL Type: random
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.neighborhood.CachingALUserNeighborhood init
INFO: Cache cleared.
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.recommender.VOIUserBasedRecommender refresh
INFO: Refreshing neighborhood, correlation, ...
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.neighborhood.CachingALUserNeighborhood init
INFO: Cache cleared.
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.neighborhood.CachingALUserNeighborhood getUserNeighborhood
INFO: cache is not being used
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel getRelatedUsers
INFO: Starting to retrieve user neighborhood ...
Apr 9, 2008 1:14:06 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel getRelatedUsers
FINE: SELECT DISTINCT user_id FROM taste_preferences WHERE  (item_id=1059 OR item_id=1088 OR item_id=1094 OR item_id=11 OR item_id=1100 OR item_id=1101 OR item_id=1197 OR item_id=1247 OR item_id=1265 OR item_id=1290 OR item_id=1307 OR item_id=1339 OR item_id=1393 OR item_id=140 OR item_id=1409 OR item_id=1441 OR item_id=151 OR item_id=1513 OR item_id=1541 OR item_id=1556 OR item_id=1629 OR item_id=1658 OR item_id=1674 OR item_id=168 OR item_id=1680 OR item_id=1721 OR item_id=1735 OR item_id=1777 OR item_id=1801 OR item_id=1834 OR item_id=1888 OR item_id=1894 OR item_id=1923 OR item_id=1931 OR item_id=1947 OR item_id=195 OR item_id=1968 OR item_id=207 OR item_id=2100 OR item_id=2145 OR item_id=2146 OR item_id=2166 OR item_id=222 OR item_id=224 OR item_id=2268 OR item_id=2297 OR item_id=2355 OR item_id=236 OR item_id=237 OR item_id=2396 OR item_id=24 OR item_id=2406 OR item_id=2424 OR item_id=2485 OR item_id=249 OR item_id=2490 OR item_id=2491 OR item_id=2496 OR item_id=2497 OR item_id=25 OR item_id=2550 OR item_id=2558 OR item_id=2570 OR item_id=2572 OR item_id=2581 OR item_id=2598 OR item_id=2599 OR item_id=2605 OR item_id=2671 OR item_id=2676 OR item_id=2683 OR item_id=2688 OR item_id=2706 OR item_id=2716 OR item_id=2724 OR item_id=2763 OR item_id=2802 OR item_id=2806 OR item_id=2826 OR item_id=2827 OR item_id=2841 OR item_id=2858 OR item_id=2861 OR item_id=2872 OR item_id=2881 OR item_id=289 OR item_id=2906 OR item_id=2908 OR item_id=2919 OR item_id=293 OR item_id=2941 OR item_id=2942 OR item_id=2961 OR item_id=3004 OR item_id=3082 OR item_id=3113 OR item_id=3147 OR item_id=3186 OR item_id=3225 OR item_id=3244 OR item_id=3259 OR item_id=3261 OR item_id=3269 OR item_id=3301 OR item_id=339 OR item_id=3394 OR item_id=3436 OR item_id=351 OR item_id=353 OR item_id=356 OR item_id=357 OR item_id=3584 OR item_id=361 OR item_id=3668 OR item_id=3699 OR item_id=377 OR item_id=378 OR item_id=3791 OR item_id=3809 OR item_id=3835 OR item_id=3844 OR item_id=39 OR item_id=509 OR item_id=539 OR item_id=553 OR item_id=555 OR item_id=587 OR item_id=605 OR item_id=64 OR item_id=708 OR item_id=736 OR item_id=74 OR item_id=804 OR item_id=838 OR item_id=852 OR item_id=898 OR item_id=912 OR item_id=914 OR item_id=969)
Apr 9, 2008 1:14:39 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel getRelatedUsers
FINE: query was executed
Apr 9, 2008 1:14:44 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel getRelatedUsers
INFO: number of users retrieved: 5885
Apr 9, 2008 1:14:44 AM com.planetj.taste.impl.neighborhood.CachingALUserNeighborhood getUserNeighborhood
INFO: caching: 4257
Apr 9, 2008 1:14:44 AM evaluate
INFO: Simulating offline AL
Apr 9, 2008 1:14:44 AM evaluate
INFO: Obtained AL Preference: GenericPreference[user: User[id:4257], item:Item[id:224], value:2.0]
Apr 9, 2008 1:14:44 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Setting preference for user '4257', item '224', value 2.0
Apr 9, 2008 1:14:44 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Executing SQL update: MERGE INTO taste_preferences(user_id,item_id,preference) key(user_id,item_id) VALUES (?, ?, ?)
Apr 9, 2008 1:14:45 AM evaluate
INFO: User's Stats:
Apr 9, 2008 1:14:45 AM evaluate
INFO: It took ms: 1146
Apr 9, 2008 1:14:45 AM evaluate
INFO: Simulating offline AL
Apr 9, 2008 1:14:45 AM evaluate
INFO: Obtained AL Preference: GenericPreference[user: User[id:4257], item:Item[id:555], value:3.0]
Apr 9, 2008 1:14:45 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Setting preference for user '4257', item '555', value 3.0
Apr 9, 2008 1:14:45 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Executing SQL update: MERGE INTO taste_preferences(user_id,item_id,preference) key(user_id,item_id) VALUES (?, ?, ?)
Apr 9, 2008 1:15:34 AM evaluate
INFO: User's Stats:
[{MAE=4.0}, {MAE=0.8246347}]
Apr 9, 2008 1:15:34 AM evaluate
INFO: It took ms: 48771
Apr 9, 2008 1:15:34 AM evaluate
INFO: Simulating offline AL
Apr 9, 2008 1:15:34 AM evaluate
INFO: Obtained AL Preference: GenericPreference[user: User[id:4257], item:Item[id:2827], value:2.0]
Apr 9, 2008 1:15:34 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Setting preference for user '4257', item '2827', value 2.0
Apr 9, 2008 1:15:34 AM com.planetj.taste.impl.model.jdbc.H2JDBCDataModel setPreference
FINE: Executing SQL update: MERGE INTO taste_preferences(user_id,item_id,preference) key(user_id,item_id) VALUES (?, ?, ?)
Apr 9, 2008 1:16:53 AM evaluate
INFO: User's Stats:
[{MAE=4.0}, {MAE=0.8246347}, {MAE=0.7886371}]
Apr 9, 2008 1:16:53 AM evaluate
INFO: It took ms: 79100
Apr 9, 2008 1:16:53 AM evaluate
INFO: Users' Stats:
4.0    0.8246347    0.7886371   


Nice Tool

Nice tool diigo which is great for bookmarking and annotating websites

Collaborative Filtering

Interesting comment:

I do not think that the next step in collaborative filtering is to find ways to improve accuracy according to some metric. I think this game got old circa 2000. I am rather looking forward to people coming up with drastically new problems and insights.

DB Optimization

Optimizing the program from the db end



If necessary can speed up correlation for influence AL by Please note, that the equation above can be replaced by an equivalent formula which avoids to use the means and is therefore much faster to calculate:


Additional DBs

See if can run experiments also on NetFlix and Jester


Correctness Verification

In the emperical results sometimes the MAE does not decrease as much as expected, or at later stages actually sometimes increases.  Need to verify that there are no bugs in the program and it functions as expected.

1) Examine the correlation measure

2) How the rating is estimated

3) etc.



Should retrieve neighborhood based only on the rated items by the user (since in real settings don't have access to the test items, and using test items may scew neighborhood).



need to get neighborhood only once in the beginning (users that rated one of the test_items)
dont imply the currently rated items when retrieving hood

as we more items are rated by the user all we need to do is to update correlation for the users that have the new rated item.



We only need to keep the records of the users that are in the neighborhood; all the other users could be deleted from the db.  To make it even more efficient should do it in steps (minimizes memory etc).


1) delete users with correlation == 0

2) delete users that did not rate any of the test items


each consequent step takes less time (since db gets smaller)

when only one items has been rated; don't need to get neighborhood and calculate MAE since it is max anyway (since correlation is 0)

May be run with NetFlix too :)



At each iteration calculate MAE for all of the items (pick next item randomly)




May add enhancement that uses only top n neighbors.  Might implement by using group by and count (neighbor that has the most items in common should be seleted).

Could select best on two criterions:

1) the users have rated the most items in common


2) the user have rated the most test items (if use this criteria for all of the methods, it is a fair one)


X -

unexpected results for Random (MAE does not decrease)

Double check that correlation is calculated correctly

double check MAE calculation etc.



Syndicate content