Yelp Recommender System¶

Clara De Paolis Kaluza¶

CS6220-03 Final Project¶

Fall 2016¶

The Data: Exploratory data analysis and pre-processing¶

The Yelp Dataset Challenge dataset provides information on a subset of its

users
businesses
reviews and ratings
tips
check-ins

It includes data from 10 cities worldwide, but the size of the data from each city differs widely

e.g. there are 36,500 businesses in Las Vegas, but only 530 in Toronto

To build a recommendation system, we first must clean the data and establish a clear goal for recommendations

Data summary¶

The Yelp Dataset Challenge data offers a rich collection of data about businesses and users on Yelp. These data include information about users, businesses, reviews, user ratings, and other information collected by Yelp, such as user "tips" for businesses and counts of user "check-ins" to businesses.

The dataset includes businesses and users from ten cities in North American and Europe. These include Edinburgh, U.K., Karlsruhe, Germany, Montreal and Waterloo, Canada, and six cities in the U.S. The number of businesses, users, and reviews in each cities is not uniform. For example, the dataset includes 36,500 businesses in Las Vegas, but only 530 in Toronto. As a starting point for this project, one city will be chosen to build the recommender system. The city will be chosen to maximize available data. The selection criteria is yet to be determined, but possible choices are the maximum number of businesses, the maximum number of reviews and tips, or the maximum number of check-ins.

User data includes information about the each user's account, such as the account creation date, the number of reviews generated by the user, the average rating given by the user, the user's "friends," "compliments" from other users, and votes from other users on reviews written by the user. In addition, each review and tip is linked to the user who wrote the review or tip. Business data includes information about the business category, location, hours of operation, average rating, number of reviews, the count of user check-in by date (though the specific user ID for each check-in is not provided), and lastly every review is linked to the business for which it was written. All these data will be used to construct features to capture business-business and user-user similarities.

Filtering and selecting the dataset¶

There are many types of businesses categories in the dataset. These categories are included as tags in the data and businesses can have multiple tags.

In [2]:

# Number of business in total
business_df['business_id'].count()

Out[2]:

Business Category¶

To get an idea of the types of busineesses included in the dataset, we can look at the "Categories" tag in the Businesses table.

In [4]:

    
categories, num_categories, top_categories = common_categories(business_df)

print("There are {} unique business categories in the dataset".format(num_categories))
print("The top categories by count are")
print('\n'.join('\t{}:{}'.format(*k) for i,k in enumerate(top_categories)))

There are 1017 unique business categories in the dataset
The top categories by count are
	Restaurants:26729
	Shopping:12444
	Food:10143
	Beauty & Spas:7490
	Health & Medical:6106
	Home Services:5866
	Nightlife:5507
	Automotive:4888
	Bars:4727
	Local Services:4041
	Active Life:3455
	Fashion:3395
	Event Planning & Services:3237
	Fast Food:3154
	Pizza:2881
	Mexican:2705
	Hotels & Travel:2673
	Sandwiches:2666
	American (Traditional):2608
	Arts & Entertainment:2447

Location¶

Next, we explore the data to select the best city with which to work. Because many cities are actually larger metropolitan areas that include several cities, we divide the data by state and look at the counts by state. We would like choose a location that has the most businesses, and restaurants in particular, but also has the most reviews for those businesses (restaurants).

In [6]:

# Business counts
business_location_counts, top_restaurant_counts = data_location(business_df, business_df)
print("Number of all business and restaurant by state")
print('\n'.join('\t{}:{}'.format(*k) for i,k in enumerate(top_restaurant_counts[:12])))

# Reviews counts
review_location_counts, top_restaurant_review_counts = data_location(review_df, business_df)
print("Number of reviews for all business and restaurant by state")
print('\n'.join('\t{}:{}'.format(*k) for i,k in enumerate(top_restaurant_review_counts[:12])))

# Check-in counts
checkin_location_counts, top_restaurant_checkin_counts = data_location(checkin_df, business_df)
print("Number of check-ins for all business and restaurant by state")
print('\n'.join('\t{}:{}'.format(*k) for i,k in enumerate(top_restaurant_checkin_counts[:12])))

# Tip counts
tips_location_counts, top_restaurant_tips_counts = data_location(tip_df, business_df)
print("Number of tips for all business and restaurant by state")
print('\n'.join('\t{}:{}'.format(*k) for i,k in enumerate(top_restaurant_tips_counts[:12])))

Number of all business and restaurant by state
	AZ:{'all': 36500, 'restaurants': 9427}
	NV:{'all': 23591, 'restaurants': 5912}
	QC:{'all': 5591, 'restaurants': 3385}
	NC:{'all': 6835, 'restaurants': 2421}
	PA:{'all': 4086, 'restaurants': 1671}
	EDH:{'all': 3297, 'restaurants': 1232}
	WI:{'all': 3066, 'restaurants': 1172}
	BW:{'all': 1055, 'restaurants': 571}
	ON:{'all': 530, 'restaurants': 372}
	IL:{'all': 808, 'restaurants': 317}
	SC:{'all': 327, 'restaurants': 143}
	MLN:{'all': 161, 'restaurants': 75}
Number of reviews for all business and restaurant by state
	NV:{'all': 1154799, 'restaurants': 662428}
	AZ:{'all': 1029103, 'restaurants': 622446}
	NC:{'all': 165625, 'restaurants': 112794}
	PA:{'all': 111542, 'restaurants': 78754}
	QC:{'all': 88046, 'restaurants': 63987}
	WI:{'all': 69917, 'restaurants': 49243}
	EDH:{'all': 30003, 'restaurants': 15189}
	IL:{'all': 19045, 'restaurants': 13054}
	ON:{'all': 5854, 'restaurants': 4814}
	SC:{'all': 5004, 'restaurants': 3532}
	BW:{'all': 3721, 'restaurants': 2714}
	TX:{'all': 911, 'restaurants': 899}
Number of check-ins for all business and restaurant by state
	AZ:{'all': 24876, 'restaurants': 8570}
	NV:{'all': 17948, 'restaurants': 5505}
	QC:{'all': 4343, 'restaurants': 2741}
	NC:{'all': 5300, 'restaurants': 2212}
	PA:{'all': 3061, 'restaurants': 1457}
	WI:{'all': 2154, 'restaurants': 1018}
	EDH:{'all': 1899, 'restaurants': 822}
	ON:{'all': 398, 'restaurants': 297}
	IL:{'all': 491, 'restaurants': 259}
	BW:{'all': 214, 'restaurants': 160}
	SC:{'all': 229, 'restaurants': 125}
	MLN:{'all': 115, 'restaurants': 58}
Number of tips for all business and restaurant by state
	NV:{'all': 313993, 'restaurants': 183039}
	AZ:{'all': 245449, 'restaurants': 156933}
	NC:{'all': 35955, 'restaurants': 24222}
	PA:{'all': 19056, 'restaurants': 12392}
	QC:{'all': 13776, 'restaurants': 9526}
	WI:{'all': 10850, 'restaurants': 7660}
	EDH:{'all': 5159, 'restaurants': 2533}
	IL:{'all': 1983, 'restaurants': 1378}
	SC:{'all': 959, 'restaurants': 673}
	ON:{'all': 825, 'restaurants': 652}
	TX:{'all': 349, 'restaurants': 348}
	MLN:{'all': 302, 'restaurants': 176}

Time¶

To train and evaluate the recommender system, the dateset must be split by time. The first time period is used to train the recommender system and the later time period is used to evaluate it. To find the appropriate train-test split, the number of reviews over time is analyzed

In [11]:

plot_reviews_over_time(counts, cummulative_counts, bin_edges, cutoff)

Utility Matrix¶

Constructing the utility matrix¶

if $r_{ij}$ is the rating user $i$ gave item $j$,

$$ R_{ij} = \left\{ \begin{array}{ll} r_{ij} & \text{if user $i$ reviewed item $j$} \\ 0 & \mathrm{otherwise} \\ \end{array} \right. $$

Users don't rate many restaurants¶

In [20]:

# plot review counts per user
plot_review_counts(M)

average number of ratings per user: 2.982617507210227

The utility matrix is sparse¶

In [21]:

# plot utility matrix
plot_sparse_matrix(M)

Test set¶

Evaluating the recommendation system will be done by comparing predicted star rating for users with reviews before and after the training/testing cut-off.

Users with rating only after the cut-off is disregarded, and not included in the utility matrix, since training on "future" data is not realistic in a "real-world" setting.

The utility matrix is constructed with the test indices masked off and the users with ratings after the cut-off removed

Methods: Content-Based vs Collaborative Filtering Recommendation¶

Content-Based¶

In content-based recommendation systems, profiles are created for items and for users.

Creating good profiles relies on designing (or learning) good features

Predictions are based on similarities between items or similarities between users

Collaborative Filtering¶

Instead of constructing item and user profiles, users are represented by their corresponding row in the utility matrix and items by their corresponding column.

Matrix factorization based predictions¶

User's rating of items are not random. They are goverened by ... something. We don't really know. Ratings could depend on many factors relating users, items, and external factors.

We assume there are some (latent) factors that can be regarded as responsible for how users review restaurants. Without explicitly constructing these factors, we can construct matrices to represent them.

The utility matrix $R$ can be factorized in the following way

factorization

$n$: number of users
$m$: number of items (businesses)
$k$: number of factors we allow for deciding ratings

The goal is to factor the utility matrix to reduce the error

i.e find $P$ and $Q$ so that the error is minimized

Root mean square error

$ R \approx PQ$

prediction for user $i$ on item $j$: $S_{ij} = (PQ)_{ij}$

$$ RMSE = \sqrt{\frac{1}{n} \sum_{i,j}\left|(s_{i,j} - r_{i,j})^2\right|} $$

Instead of explicitly performing the $PQ$ multiplication, we can compute $S_{ij} = P_{i,:}Q_{:,j}$ only for non-zero values in the utility matrix

Finding optimal $P$ and $Q$¶

Linear regression and least squares¶

In linear regression, we had a data matrix $\mathbf{X}$ and a target $\mathbf{y}$ and we wanted to find a coefficient vector $\mathbf{w}$ so that

$$ \mathbf{y} = \mathbf{X}\mathbf{w} + \lambda \mathbf{w}^T\mathbf{w} $$

To find $\mathbf{w}$ that minimized the MSE $\left(MSE = \frac{1}{N} \displaystyle\sum_{n=1}^N (\mathbf{w}^T\mathbf{x}_n - y_n)^2 + \lambda\|\mathbf{w}\|_2^2 \right)$ there is a closed-formed solution:

$$ \mathbf{w} = \left(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I} \right)^{-1}\mathbf{X}^Ty $$

Multivariable linear regression and least squares¶

If instead of predicting a single target value, we want to predict a vector for each instance $\mathbf{x}_i$, we need to find a matrix of coefficients

and the closed-form solution is $$ \mathbf{W} = \left(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I} \right)^{-1}\mathbf{X}^T\mathbf{Y} $$

The multivariable regression problem $$\mathbf{Y}=\mathbf{X}\mathbf{W} + \lambda \|\mathbf{W}\|^2$$ looks very similar to the matrix factorization problem $$ \mathbf{R} = \mathbf{P}\mathbf{Q} + \lambda\left(\|\mathbf{P}\|^2 + \|\mathbf{Q}\|^2\right) $$

except in MF,
$$ {}$$

**We don't know either of the matrices on the righthand side!**$$ {}$$

Therefore the objective is non-convex

and the problem is NP-hard

Alternating least squares¶

Using alternating least squares, we can fix one matrix, solve for the other, then use those values to fix the second matrix and solve for the first. We continue to alternate between solving both until the matrices converge

Repeat:

Fix $\mathbf{Q}$. Solve for $\mathbf{P}$
- $\mathbf{P} =\left(\mathbf{Q}\mathbf{Q}^T + \lambda\mathbf{I}\right)^{-1}\mathbf{Q}\mathbf{R}$
Fix $\mathbf{P}$. Solve for $\mathbf{Q}$
- $\mathbf{Q} =\left(\mathbf{P}\mathbf{P}^T + \lambda\mathbf{I}\right)^{-1}\mathbf{P}\mathbf{R}^T$

Now, in each step the problem is just a multivariable linear regression where the feature matrix is $\mathbf{Q}$ (or $\mathbf{P}$) and the coefficient matrix is $\mathbf{P}$ (or $\mathbf{Q}$)

Note about regularization¶

modify the objective value to use weighted-$\lambda$-regularization¹ $$ \mathbf{R} = \mathbf{P}\mathbf{Q} + \lambda\left(\mathbf{n}_u \|\mathbf{P}\|^2 + \mathbf{n}_b\|\mathbf{Q}\|^2\right) $$

to include $n_u$, the number of businesses each user reviewed, and $n_b$, the number of users that reviewed each business

Repeat:

Fix $\mathbf{Q}$. Solve for $\mathbf{P}$
- $\mathbf{P} =\left(\mathbf{Q}\mathbf{Q}^T + \mathbf{n}_u\lambda\mathbf{I}\right)^{-1}\mathbf{Q}\mathbf{R}$
Fix $\mathbf{P}$. Solve for $\mathbf{Q}$
- $\mathbf{Q} =\left(\mathbf{P}\mathbf{P}^T + \mathbf{n}_b\lambda\mathbf{I}\right)^{-1}\mathbf{P}\mathbf{R}^T$

_{^{[1] Zhou, Y. et al., 2008. Large-scale parallel collaborative filtering for the netflix prize. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp. 337-348.}}

Results¶

In [326]:

for lmbda, k in param_pairs:
    if (lmbda,k)not in results.keys() or results[(lmbda,k)]['P'] is None:
        print('lambda {}, k {}'. format(lmbda,k))
        P, Q, train_errors, validation_errors, test_errors = alsq(
            R, M, k, lmbda, max_iters, train_indxs, validation_indxs, test_indxs)
        results[(lmbda, k)] = {'train': train_errors, 
                                'validate': validation_errors,
                                'test': test_errors,
                                'P': P,
                                'Q': Q}

lambda 5, k 2
[Iteration 1/20: train RMSE 3.110945782088086, validation RMSE 1.4962039263643991, test RMSE 1.5381778214157509]
[Iteration 2/20: train RMSE 2.2319606259400837, validation RMSE 1.2514785329669016, test RMSE 1.3442927148632347]
[Iteration 3/20: train RMSE 2.0720930170066088, validation RMSE 1.1993932138703314, test RMSE 1.2997340387729424]
[Iteration 4/20: train RMSE 2.020959511903329, validation RMSE 1.182239546764226, test RMSE 1.2780946831626183]
[Iteration 5/20: train RMSE 2.003961204685282, validation RMSE 1.1762542179494468, test RMSE 1.2663021076618328]
[Iteration 6/20: train RMSE 1.9985774477750133, validation RMSE 1.1744357721269125, test RMSE 1.2594438569536164]
[Iteration 7/20: train RMSE 1.9972978359061795, validation RMSE 1.1742838684952777, test RMSE 1.2551062663128012]
[Iteration 8/20: train RMSE 1.9974686965314588, validation RMSE 1.1747077063049038, test RMSE 1.2521876988061313]
[Iteration 9/20: train RMSE 1.9980903628336695, validation RMSE 1.1752541493878865, test RMSE 1.2501606715260039]
[Iteration 10/20: train RMSE 1.9987550986871994, validation RMSE 1.1757812073601024, test RMSE 1.2487294827059452]
[Iteration 11/20: train RMSE 1.999330471853797, validation RMSE 1.1762495411464093, test RMSE 1.2477074865890467]
[Iteration 12/20: train RMSE 1.9997899916945343, validation RMSE 1.1766484810762678, test RMSE 1.2469713768483188]
[Iteration 13/20: train RMSE 2.000134240024294, validation RMSE 1.1769813016006405, test RMSE 1.2464390913251007]
[Iteration 14/20: train RMSE 2.0003696299829867, validation RMSE 1.1772590155511262, test RMSE 1.2460558747309587]
[Iteration 15/20: train RMSE 2.0005064462150624, validation RMSE 1.1774948484174128, test RMSE 1.2457846328446571]
[Iteration 16/20: train RMSE 2.000560009832354, validation RMSE 1.1777003524316605, test RMSE 1.245599340197056]
[Iteration 17/20: train RMSE 2.000552070196739, validation RMSE 1.1778831637317044, test RMSE 1.2454803495385969]
[Iteration 18/20: train RMSE 2.000511520346015, validation RMSE 1.1780461940583802, test RMSE 1.245410742536233]
[Iteration 19/20: train RMSE 2.000467682100518, validation RMSE 1.1781886776372013, test RMSE 1.2453742137930404]
[Iteration 20/20: train RMSE 2.0004363679451993, validation RMSE 1.1783097200620212, test RMSE 1.2453564107584338]
lambda 5, k 5
[Iteration 1/20: train RMSE 2.2582364900115612, validation RMSE 1.2833092534270094, test RMSE 1.3797955417690915]
[Iteration 2/20: train RMSE 1.9662826895473566, validation RMSE 1.2300908620547861, test RMSE 1.3312184673881102]
[Iteration 3/20: train RMSE 1.9363010227663797, validation RMSE 1.2153401521798683, test RMSE 1.3128894392687194]
[Iteration 4/20: train RMSE 1.9280473184812477, validation RMSE 1.2048787778880818, test RMSE 1.2976676657200243]
[Iteration 5/20: train RMSE 1.924853143197236, validation RMSE 1.1976290476372307, test RMSE 1.2860932552764253]
[Iteration 6/20: train RMSE 1.92333337538937, validation RMSE 1.1925514145047118, test RMSE 1.2776514366152298]
[Iteration 7/20: train RMSE 1.9224464360591105, validation RMSE 1.1887904628032877, test RMSE 1.271523020158568]
[Iteration 8/20: train RMSE 1.921785768017903, validation RMSE 1.1857846577545188, test RMSE 1.267013209440568]
[Iteration 9/20: train RMSE 1.9211696310999096, validation RMSE 1.1832171537519374, test RMSE 1.263635280438189]
[Iteration 10/20: train RMSE 1.9205262837755268, validation RMSE 1.1809504455839206, test RMSE 1.2610796384903213]
[Iteration 11/20: train RMSE 1.9198416957216713, validation RMSE 1.1789563683503117, test RMSE 1.2591490947450628]
[Iteration 12/20: train RMSE 1.919123462117578, validation RMSE 1.17725261491611, test RMSE 1.2577064268405733]
[Iteration 13/20: train RMSE 1.918385702912294, validation RMSE 1.1758586216885023, test RMSE 1.2566473036833041]
[Iteration 14/20: train RMSE 1.917645023389206, validation RMSE 1.1747726205004452, test RMSE 1.2558885024536155]
[Iteration 15/20: train RMSE 1.916918043066739, validation RMSE 1.1739672041546259, test RMSE 1.2553620294471326]
[Iteration 16/20: train RMSE 1.9162196100120645, validation RMSE 1.1733974871811717, test RMSE 1.2550118647741773]
[Iteration 17/20: train RMSE 1.915561680380291, validation RMSE 1.173013215258407, test RMSE 1.254792256927266]
[Iteration 18/20: train RMSE 1.9149523384799771, validation RMSE 1.1727679523997485, test RMSE 1.254666718684552]
[Iteration 19/20: train RMSE 1.9143952958249009, validation RMSE 1.172623278910957, test RMSE 1.2546070636639388]
[Iteration 20/20: train RMSE 1.9138903465067056, validation RMSE 1.172549339599572, test RMSE 1.2545921819047434]
lambda 5, k 10
[Iteration 1/20: train RMSE 2.0130694031397036, validation RMSE 1.2279553300848227, test RMSE 1.3165179622979126]
[Iteration 2/20: train RMSE 1.8948603745404622, validation RMSE 1.19572371242983, test RMSE 1.2843355227785749]
[Iteration 3/20: train RMSE 1.887796704414249, validation RMSE 1.193692812801721, test RMSE 1.2778404978850135]
[Iteration 4/20: train RMSE 1.8843489858967635, validation RMSE 1.192327397379505, test RMSE 1.2727161969943317]
[Iteration 5/20: train RMSE 1.881687077187238, validation RMSE 1.1910330943678054, test RMSE 1.2687071150743914]
[Iteration 6/20: train RMSE 1.879583808041864, validation RMSE 1.189800073897304, test RMSE 1.2657314025833553]
[Iteration 7/20: train RMSE 1.8778585196920097, validation RMSE 1.1887761784838924, test RMSE 1.2636124317762356]
[Iteration 8/20: train RMSE 1.8763669998517516, validation RMSE 1.1880310481176257, test RMSE 1.2621477256173501]
[Iteration 9/20: train RMSE 1.8750213206795363, validation RMSE 1.187551268023915, test RMSE 1.2611586209468284]
[Iteration 10/20: train RMSE 1.8737765300386118, validation RMSE 1.187291726103361, test RMSE 1.260507396278861]
[Iteration 11/20: train RMSE 1.8726138340603038, validation RMSE 1.187202215858874, test RMSE 1.2600941428194357]
[Iteration 12/20: train RMSE 1.8715278678891398, validation RMSE 1.187236673473767, test RMSE 1.2598481221369051]
[Iteration 13/20: train RMSE 1.8705187307884497, validation RMSE 1.187355977030994, test RMSE 1.2597196611476074]
[Iteration 14/20: train RMSE 1.8695875156859465, validation RMSE 1.1875286719007319, test RMSE 1.2596738694322136]
[Iteration 15/20: train RMSE 1.8687340583163323, validation RMSE 1.1877308656943886, test RMSE 1.259686079477981]
[Iteration 16/20: train RMSE 1.8679560898897953, validation RMSE 1.1879455818419582, test RMSE 1.2597386733591085]
[Iteration 17/20: train RMSE 1.8672493028287453, validation RMSE 1.1881616902871428, test RMSE 1.2598189455647497]
[Iteration 18/20: train RMSE 1.866607934954759, validation RMSE 1.188372588652509, test RMSE 1.2599176768412186]
[Iteration 19/20: train RMSE 1.8660255137606767, validation RMSE 1.1885748632554505, test RMSE 1.26002815963478]
[Iteration 20/20: train RMSE 1.8654955040972248, validation RMSE 1.1887671288619657, test RMSE 1.2601455051265287]
lambda 5, k 20
[Iteration 1/20: train RMSE 2.0463711193706526, validation RMSE 1.2772807319296917, test RMSE 1.3178362054101425]
[Iteration 2/20: train RMSE 1.896799734723713, validation RMSE 1.2003723277653142, test RMSE 1.267475149993136]
[Iteration 3/20: train RMSE 1.878390328698135, validation RMSE 1.1897554604302627, test RMSE 1.2616417330812635]
[Iteration 4/20: train RMSE 1.8690711315360404, validation RMSE 1.1851168753370913, test RMSE 1.2601665725734987]
[Iteration 5/20: train RMSE 1.8628709366359777, validation RMSE 1.182724031767976, test RMSE 1.2598779056752092]
[Iteration 6/20: train RMSE 1.858223179405607, validation RMSE 1.1815620040451655, test RMSE 1.2600055136355535]
[Iteration 7/20: train RMSE 1.8545674845617386, validation RMSE 1.1810764615313718, test RMSE 1.2602869911927086]
[Iteration 8/20: train RMSE 1.8516249256126627, validation RMSE 1.1809506460607095, test RMSE 1.2606168382853482]
[Iteration 9/20: train RMSE 1.849217974816988, validation RMSE 1.1810092662972773, test RMSE 1.2609504605286195]
[Iteration 10/20: train RMSE 1.8472218688666076, validation RMSE 1.1811565764320797, test RMSE 1.2612693732335383]
[Iteration 11/20: train RMSE 1.8455458482143956, validation RMSE 1.1813406219193683, test RMSE 1.2615667292251087]
[Iteration 12/20: train RMSE 1.8441228039458015, validation RMSE 1.1815336859345285, test RMSE 1.2618409446829162]
[Iteration 13/20: train RMSE 1.8429024349926133, validation RMSE 1.1817216180979084, test RMSE 1.2620928141841692]
[Iteration 14/20: train RMSE 1.8418464633209854, validation RMSE 1.1818978718658466, test RMSE 1.2623241656734427]
[Iteration 15/20: train RMSE 1.840925271633097, validation RMSE 1.1820600876708942, test RMSE 1.2625372006451576]
[Iteration 16/20: train RMSE 1.8401155690877569, validation RMSE 1.1822081178992743, test RMSE 1.2627341491481996]
[Iteration 17/20: train RMSE 1.8393987887544572, validation RMSE 1.1823429019532823, test RMSE 1.2629170834763503]
[Iteration 18/20: train RMSE 1.8387599841450062, validation RMSE 1.1824658472310547, test RMSE 1.2630878239976253]
[Iteration 19/20: train RMSE 1.8381870507103393, validation RMSE 1.182578503837484, test RMSE 1.2632479046674974]
[Iteration 20/20: train RMSE 1.8376701531777342, validation RMSE 1.182682401635325, test RMSE 1.2633985777785317]

In [329]:

plot_results_reg(results,lmbdas,ks)

Evaluate¶

Error measure¶

To evaluate the model, we compare the RMSE of this method with a simple baseline measure

$$ RMSE = \sqrt{\frac{1}{n} \sum_{i,j}\left|(s_{i,j} - r_{i,j})^2\right|} $$

Baseline¶

The baseline always predicts the average rating for a particular item

In [599]:

# Baseline
R = scipy.sparse.csc_matrix(R,dtype=np.uint32) #fix sum overflow!!!
count_nnz = R.getnnz(0)
sums = np.array(R.sum(0)).flatten()
means = np.zeros(len(count_nnz))
for i,c in enumerate(count_nnz):
    if c != 0:
        means[i] = sums[i]/c
overall_mean = np.mean(means[means!=0])
means[means==0]=overall_mean
avg_errors = []
for i,j in test_indxs:
    actual = M[i,j]
    avg = means[j]
    if avg==0:
        avg = overall_mean
    avgerr = rmse_pred(actual, avg)
    avg_errors += [avgerr]
# Average Baseline error
bl_err = np.mean(avg_errors)
print('Average Baseline MSE: {:.4f}'.format(bl_err))

Average Baseline MSE: 1.1550

In [603]:

plt.figure()
plt.hist(avg_errors)
plt.xlabel('MSE')
plt.title('MSE for test set')

Out[603]:

<matplotlib.text.Text at 0x888f27f0>

In [521]:

#Calculate MSE for each parameter pair
MF_errors = {(l,k):[] for l,k in results.keys()}
MFVal_errors = {(l,k):[] for l,k in results.keys()}

#Matrix Factorization            
for l,k in param_pairs:
    if results[(l,k)]['P'] is not None:
        P = results[(l,k)]['P']
        Q = results[(l,k)]['Q']
        
        for i,j in validation_indxs:        
            actual = M[i,j]
            MF = predict(P,Q,i,j)
            MFerr = rmse_pred(actual, MF)
            if len(MFerr)>0:
                MFVal_errors[l,k] += [MFerr[0]]

        for i,j in test_indxs:        
            actual = M[i,j]
            MF = predict(P,Q,i,j)
            MFerr = rmse_pred(actual, MF)
            if len(MFerr)>0:
                MF_errors[l,k] += [MFerr[0]]
                
        valerr = np.sqrt(np.sum(MFVal_errors[l,k])/len(MFVal_errors[l,k]))
        testerr = np.sqrt(np.sum(MF_errors[l,k])/len(MF_errors[l,k]))
        print('lambda = {}, k = {} \t| val error {:.4f} \t| test error {:.4f}'.format(l,k,valerr, testerr))

lambda = 0.1, k = 2 	| val error 1.0147 	| test error 1.2590
lambda = 0.1, k = 5 	| val error 1.0973 	| test error 1.2806
lambda = 0.1, k = 10 	| val error 1.1478 	| test error 1.2269
lambda = 0.1, k = 20 	| val error 1.1303 	| test error 1.1761
lambda = 1, k = 2 	| val error 0.9807 	| test error 1.1531
lambda = 1, k = 5 	| val error 1.0564 	| test error 1.2341
lambda = 1, k = 10 	| val error 1.1174 	| test error 1.2155
lambda = 1, k = 20 	| val error 1.0953 	| test error 1.1896
lambda = 2, k = 2 	| val error 0.9940 	| test error 1.1375
lambda = 2, k = 5 	| val error 1.0471 	| test error 1.1889
lambda = 2, k = 10 	| val error 1.0767 	| test error 1.1943
lambda = 2, k = 20 	| val error 1.0682 	| test error 1.1877
lambda = 5, k = 2 	| val error 1.0715 	| test error 1.2309
lambda = 5, k = 5 	| val error 1.0663 	| test error 1.2385
lambda = 5, k = 10 	| val error 1.0795 	| test error 1.2454
lambda = 5, k = 20 	| val error 1.0749 	| test error 1.2492
lambda = 10, k = 2 	| val error 1.1944 	| test error 1.4074
lambda = 10, k = 5 	| val error 1.1900 	| test error 1.4070
lambda = 10, k = 10 	| val error 1.1902 	| test error 1.4079
lambda = 10, k = 20 	| val error 1.1904 	| test error 1.4079

In [339]:

#effect of lambda
fig = plt.figure()
for k in ks:
    errs = [np.sqrt(np.sum(MFVal_errors[l,k])/len(MFVal_errors[l,k])) for l in lmbdas]
    plt.plot(lmbdas,errs,label='k={}'.format(k))
plt.legend(loc=4)
plt.title('Validataion MSE')
plt.xlabel('$\lambda$')

Out[339]:

<matplotlib.text.Text at 0x8efff2e8>

In [342]:

#effect of k
fig = plt.figure()
for l in lmbdas:
    errs = [np.sqrt(np.sum(MFVal_errors[l,k])/len(MFVal_errors[l,k])) for k in ks]
    plt.plot(ks,errs,label='$\lambda$={}'.format(l))
plt.legend(loc=4)
plt.title('Validataion MSE')
plt.xlabel('$k$')

Out[342]:

<matplotlib.text.Text at 0x9c9166a0>

In [ ]: