Similarity-based Game Recommender System

Author: Zhanglin Liu

Date: 12/06/2020

Background

The dataset used in this project is the version 1: User and Item Data of the Steam Video Game and Bundle Data. This data contains User-Item data from the Steam Video game platform.

Below are the citation of these datasets:

Self-attentive sequential recommendation Wang-Cheng Kang, Julian McAuley ICDM, 2018

Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018

Generating and personalizing bundle recommendations on Steam Apurva Pathak, Kshitiz Gupta, Julian McAuley SIGIR, 2017

Data Exploration

Loading Dataset

In [1]:
import ast
from pandas import DataFrame
from collections import defaultdict

f = open("australian_users_items.json", encoding = "utf8")

df = []
In [2]:
for line in f:
    line = f.readline()
    # validating python code before append to df
    df.append(ast.literal_eval(line))
In [3]:
all_users = []
for d in df:
    user = d['user_id']
    all_users.append(user)
In [4]:
user_df = DataFrame(all_users, columns = ['user_ids'])
user_df
Out[4]:
user_ids
0 js41637
1 Riot-Punch
2 MinxIsBetterThanPotatoes
3 themanwich
4 Wackky
... ...
44150 76561198319916652
44151 76561198320136420
44152 76561198323066619
44153 XxLaughingJackClown77xX
44154 edward_tremethick

44155 rows × 1 columns

Observation

  • There are 44155 user_id elements in this dataset
  • these elements are of string type
In [5]:
# total user elements vs. number of unique user elements in this dataset
len(df),len(user_df['user_ids'].unique())
Out[5]:
(44155, 44012)

Observation

  • there are 143 non-unique user_id elements
In [6]:
all_items = []
for d in df:
    length = len(d['items'])
    for m in range(length):
        items = d['items'][m]['item_id']
        all_items.append(items)
In [7]:
items_df = DataFrame(all_items, columns = ['item_ids'])
items_df
Out[7]:
item_ids
0 10
1 80
2 100
3 300
4 30
... ...
2588408 497810
2588409 497811
2588410 497812
2588411 497813
2588412 417860

2588413 rows × 1 columns

In [8]:
# number of unique item elements in this dataset
len(items_df['item_ids'].unique())
Out[8]:
10397
In [9]:
# data example
dict(list(df[0].items())[0:4])
Out[9]:
{'user_id': 'js41637',
 'items_count': 888,
 'steam_id': '76561198035864385',
 'user_url': 'http://steamcommunity.com/id/js41637'}

df gives a general information on the total item_count (number of games), steamd_id, user_url, and items (the details of each item) by user_id. For example, for the first user with user_id of 'js41637', this user is associated with 888 items in total.

Due to the hardware memory capacity constraints, I have excluded element items in the above code block. I will show a portion of the items element in the next code block.

In [10]:
# below shows the first 5 items and their elements 
# that user_id of 'js41637' is associated with
dict(list(df[0].items()))['items'][0:5]
Out[10]:
[{'item_id': '10',
  'item_name': 'Counter-Strike',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '80',
  'item_name': 'Counter-Strike: Condition Zero',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '100',
  'item_name': 'Counter-Strike: Condition Zero Deleted Scenes',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '300',
  'item_name': 'Day of Defeat: Source',
  'playtime_forever': 220,
  'playtime_2weeks': 0},
 {'item_id': '30',
  'item_name': 'Day of Defeat',
  'playtime_forever': 0,
  'playtime_2weeks': 0}]

Data Preparation

In [11]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

itemNames = {}

for d in df:
        user = d['user_id']
        length = len(d['items'])
        for n in range(length):
            item = d['items'][n]['item_id']
            usersPerItem[item].add(user)
            itemsPerUser[user].add(item)
            itemNames[item] = d['items'][n]['item_name']

Jaccard Similarity Measure

In [12]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom
In [13]:
# determine what is similar within the dataset
# it takes in "item_id" 
# and n which is the number of similar items we would like
def mostSimilar(ID, n):
    similarities = []
    users = usersPerItem[ID]
    for i in usersPerItem:
        if i == ID: continue
        sim = Jaccard(users, usersPerItem[i])
        similarities.append((sim, i))
    similarities.sort(reverse = True)
    return similarities[:n]  

Getting Recommendation

Recommendation #1

In [14]:
# the first item_id from the very first user's item list
query = df[0]['items'][0]['item_id']
query
Out[14]:
'10'
In [15]:
# getting the item_name of the item_id of 10
itemNames[query]
Out[15]:
'Counter-Strike'
In [16]:
# gives the Jaccard similarity measure 
# and 10 items that are most similar to the input "item_id"
# outputs in most similar to least similar order
mostSimilar(query,10)
Out[16]:
[(0.9064674580433892, '80'),
 (0.9064674580433892, '100'),
 (0.33818210410441896, '240'),
 (0.3342730567861458, '30'),
 (0.3333333333333333, '40'),
 (0.33265430841311877, '60'),
 (0.28612670408981555, '20'),
 (0.2855763039278815, '50'),
 (0.2852502583788572, '70'),
 (0.28520556814503073, '130')]
In [17]:
# code above gives 10 most similar item_id to "Counter-Stike"
# below shows what these item names are for these 10 item_id
[itemNames[x[1]] for x in mostSimilar(query,10)]
Out[17]:
['Counter-Strike: Condition Zero',
 'Counter-Strike: Condition Zero Deleted Scenes',
 'Counter-Strike: Source',
 'Day of Defeat',
 'Deathmatch Classic',
 'Ricochet',
 'Team Fortress Classic',
 'Half-Life: Opposing Force',
 'Half-Life',
 'Half-Life: Blue Shift']

Recommendation #2

In [18]:
# the 100th item_id from the very first user's item list
query1 = df[0]['items'][100]['item_id']
query1
Out[18]:
'22380'
In [19]:
# the item_name of the item_id 
itemNames[query1]
Out[19]:
'Fallout: New Vegas'
In [20]:
mostSimilar(query1,5)
Out[20]:
[(0.37277462489310426, '72850'),
 (0.3415632246623926, '22370'),
 (0.3224101479915433, '8870'),
 (0.31585220500595945, '377160'),
 (0.31223010487353486, '49520')]
In [21]:
[itemNames[x[1]] for x in mostSimilar(query1,5)]
Out[21]:
['The Elder Scrolls V: Skyrim',
 'Fallout 3 - Game of the Year Edition',
 'BioShock Infinite',
 'Fallout 4',
 'Borderlands 2']

Recommendation #3

In [22]:
# the 15th item_id from the 10th user's item list
query2 = df[10]['items'][15]['item_id']
query2
Out[22]:
'12900'
In [23]:
# the item_name of the item_id 
itemNames[query2]
Out[23]:
'Audiosurf'
In [24]:
mostSimilar(query2,8)
Out[24]:
[(0.19985264321237797, '40800'),
 (0.19606612261979495, '107100'),
 (0.189873417721519, '3830'),
 (0.18828828828828828, '57300'),
 (0.18545454545454546, '22000'),
 (0.17699115044247787, '50620'),
 (0.17512420156139105, '48000'),
 (0.1749508989273304, '17410')]
In [25]:
[itemNames[x[1]] for x in mostSimilar(query2,8)]
Out[25]:
['Super Meat Boy',
 'Bastion',
 'Psychonauts',
 'Amnesia: The Dark Descent',
 'World of Goo',
 'Darksiders',
 'LIMBO',
 "Mirror's Edge"]

Conclusion

Similarity-based recommender system recommends similar items to user based on their input of item which they already have experience with. In this case, it recommends games based on the similarities between the input game and the game in the dataset. This project utilizes the Jaccard similarity measure, but other alternatives such as cosine similarity can also be used.