案例:
啤酒与尿布: 沃尔玛超市在分析销售记录时,发现了啤酒与尿布经常一起被购买,于是他们调整了货架将两者放在了一起,结果真的提升了啤酒的销量。 原因解释: 爸爸在给宝宝买尿布的时候,会顺便给自己买点啤酒?
概述:
Apriori算法是一种最有影响力的挖掘布尔关联规则的频繁项集的算法,其命名Apriori源于算法使用了频繁项集性质的先验(Prior)知识。
接下来我们将以超市订单的例子理解关联分析相关的重要概念: Support(支持度)、Confidence(置信度)、Lift(提升度)。
例:Support('Bread') = 4/5 = 0.8 Support('Milk') = 4/5 = 0.8
Support('Bread+Milk') = 3/5 = 0.6
例:Confidence('Bread'—> 'Milk') = Support('Bread+Milk')/ Support('Bread') = 0.6/0.8 = 0.75
例:Lift('Bread'—> 'Milk') = 0.75/0.8 = 0.9375
对于Lift(提升度)有三种情况:
原理:
该算法挖掘关联规则的过程,即是查找频繁项集(frequent itemset)的过程:
流程:
K = 1, 计算K项集的支持度;
筛选掉小于最小支持度的项集;
如果项集为空,则对应K-1项集的结果为最终结果。否则K = K+1重复2-3步
import pandas as pd
import matplotlib.pyplot as plt
import mlxtend
import numpy as np
movie_data_file = './movie_dataset/movies_metadata.csv'
ratings_file = './movie_dataset/ratings_small.csv'
movie_data_df = pd.read_csv(movie_data_file)
ratings_df = pd.read_csv(ratings_file)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
movie_data_df.head(5)
| adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
| 1 | False | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
| 2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | 1995-12-22 | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 |
| 3 | False | NaN | 16000000 | [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | 1995-12-22 | 81452156.0 | 127.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 |
| 4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [{'id': 35, 'name': 'Comedy'}] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 1995-02-10 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 |
5 rows × 24 columns
movie_data_df.describe()
| revenue | runtime | vote_average | vote_count | |
|---|---|---|---|---|
| count | 4.546000e+04 | 45203.000000 | 45460.000000 | 45460.000000 |
| mean | 1.120935e+07 | 94.128199 | 5.618207 | 109.897338 |
| std | 6.433225e+07 | 38.407810 | 1.924216 | 491.310374 |
| min | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000e+00 | 85.000000 | 5.000000 | 3.000000 |
| 50% | 0.000000e+00 | 95.000000 | 6.000000 | 10.000000 |
| 75% | 0.000000e+00 | 107.000000 | 6.800000 | 34.000000 |
| max | 2.787965e+09 | 1256.000000 | 10.000000 | 14075.000000 |
movie_data_df.info
<bound method DataFrame.info of adult belongs_to_collection budget \
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000
1 False NaN 65000000
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0
3 False NaN 16000000
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0
... ... ... ...
45461 False NaN 0
45462 False NaN 0
45463 False NaN 0
45464 False NaN 0
45465 False NaN 0
genres \
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4 [{'id': 35, 'name': 'Comedy'}]
... ...
45461 [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462 [{'id': 18, 'name': 'Drama'}]
45463 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464 []
45465 []
homepage id imdb_id \
0 http://toystory.disney.com/toy-story 862 tt0114709
1 NaN 8844 tt0113497
2 NaN 15602 tt0113228
3 NaN 31357 tt0114885
4 NaN 11862 tt0113041
... ... ... ...
45461 http://www.imdb.com/title/tt6209470/ 439050 tt6209470
45462 NaN 111109 tt2028550
45463 NaN 67758 tt0303758
45464 NaN 227506 tt0008536
45465 NaN 461257 tt6980792
original_language original_title \
0 en Toy Story
1 en Jumanji
2 en Grumpier Old Men
3 en Waiting to Exhale
4 en Father of the Bride Part II
... ... ...
45461 fa رگ خواب
45462 tl Siglo ng Pagluluwal
45463 en Betrayal
45464 en Satana likuyushchiy
45465 en Queerama
overview ... release_date \
0 Led by Woody, Andy's toys live happily in his ... ... 1995-10-30
1 When siblings Judy and Peter discover an encha... ... 1995-12-15
2 A family wedding reignites the ancient feud be... ... 1995-12-22
3 Cheated on, mistreated and stepped on, the wom... ... 1995-12-22
4 Just when George Banks has recovered from his ... ... 1995-02-10
... ... ... ...
45461 Rising and falling between a man and woman. ... NaN
45462 An artist struggles to finish his work while a... ... 2011-11-17
45463 When one of her hits goes wrong, a professiona... ... 2003-08-01
45464 In a small town live two brothers, one a minis... ... 1917-10-21
45465 50 years after decriminalisation of homosexual... ... 2017-06-09
revenue runtime spoken_languages \
0 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}]
1 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}]
3 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}]
4 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}]
... ... ... ...
45461 0.0 90.0 [{'iso_639_1': 'fa', 'name': 'فارسی'}]
45462 0.0 360.0 [{'iso_639_1': 'tl', 'name': ''}]
45463 0.0 90.0 [{'iso_639_1': 'en', 'name': 'English'}]
45464 0.0 87.0 []
45465 0.0 75.0 [{'iso_639_1': 'en', 'name': 'English'}]
status tagline \
0 Released NaN
1 Released Roll the dice and unleash the excitement!
2 Released Still Yelling. Still Fighting. Still Ready for...
3 Released Friends are the people who let you be yourself...
4 Released Just When His World Is Back To Normal... He's ...
... ... ...
45461 Released Rising and falling between a man and woman
45462 Released NaN
45463 Released A deadly game of wits.
45464 Released NaN
45465 Released NaN
title video vote_average vote_count
0 Toy Story False 7.7 5415.0
1 Jumanji False 6.9 2413.0
2 Grumpier Old Men False 6.5 92.0
3 Waiting to Exhale False 6.1 34.0
4 Father of the Bride Part II False 5.7 173.0
... ... ... ... ...
45461 Subdue False 4.0 1.0
45462 Century of Birthing False 9.0 3.0
45463 Betrayal False 3.8 6.0
45464 Satan Triumphant False 0.0 0.0
45465 Queerama False 0.0 0.0
[45466 rows x 24 columns]>
movie_data_df.count()
adult 45466
belongs_to_collection 4494
budget 45466
genres 45466
homepage 7782
id 45466
imdb_id 45449
original_language 45455
original_title 45466
overview 44512
popularity 45461
poster_path 45080
production_companies 45463
production_countries 45463
release_date 45379
revenue 45460
runtime 45203
spoken_languages 45460
status 45379
tagline 20412
title 45460
video 45460
vote_average 45460
vote_count 45460
dtype: int64
movie_data_df.columns
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')
ratings_df.head(5)
| userId | movieId | rating | timestamp | |
|---|---|---|---|---|
| 0 | 1 | 31 | 2.5 | 1260759144 |
| 1 | 1 | 1029 | 3.0 | 1260759179 |
| 2 | 1 | 1061 | 3.0 | 1260759182 |
| 3 | 1 | 1129 | 2.0 | 1260759185 |
| 4 | 1 | 1172 | 4.0 | 1260759205 |
ratings_df.columns
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
ratings_df.count()
userId 100004
movieId 100004
rating 100004
timestamp 100004
dtype: int64
ratings_df.shape
(100004, 4)
movie_data_df.shape
(45466, 24)
movie_data_df_t=movie_data_df[['title','id']]
movie_data_df_t.dtypes
title object
id object
dtype: object
ratings_df_s = ratings_df.drop(['timestamp'], axis=1) #axis=0 跨列删除行 ,axis=1 跨行删除列
ratings_df_s.dtypes
userId int64
movieId int64
rating float64
dtype: object
# pd.to_numeric 将id列 的数据 由字符串转为数值类型, 不能转换的数据设置为NaN
pd.to_numeric(movie_data_df_t['id'],errors='coerce')
0 862.0
1 8844.0
2 15602.0
3 31357.0
4 11862.0
...
45461 439050.0
45462 111109.0
45463 67758.0
45464 227506.0
45465 461257.0
Name: id, Length: 45466, dtype: float64
#np.where返回满足()内条件的数据所在的位置
np.where(pd.to_numeric(movie_data_df_t['id'], errors='coerce').isna()) #返回缺失值的位置,其中isna() 对于NaN返回True,否则返回False
(array([19730, 29503, 35587], dtype=int64),)
movie_data_df_t.iloc[19730]
title NaN
id 1997-08-20
Name: 19730, dtype: object
movie_data_df_t.iloc[[19730,29503,35587]]
| title | id | |
|---|---|---|
| 19730 | NaN | 1997-08-20 |
| 29503 | NaN | 2012-09-29 |
| 35587 | NaN | 2014-01-01 |
# 将格式转换后的数据 赋值给id列
movie_data_df_t['id'] = pd.to_numeric(movie_data_df_t['id'], errors='coerce')
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movie_data_df_t.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 45460 non-null object
1 id 45463 non-null float64
dtypes: float64(1), object(1)
memory usage: 710.5+ KB
movie_data_df_t.iloc[[19730,29503,35587]]
| title | id | |
|---|---|---|
| 19730 | NaN | NaN |
| 29503 | NaN | NaN |
| 35587 | NaN | NaN |
movie_data_df_t.shape
(45466, 2)
movie_data_df_t.drop(np.where(movie_data_df_t['id'].isna())[0], inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\pandas\core\frame.py:4174: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
movie_data_df_t.shape
(45463, 2)
movie_data_df_t.duplicated(['id','title']).sum()
30
movie_data_df_t.drop_duplicates(['id'],inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
movie_data_df_t.shape
(45433, 2)
ratings_df_s.duplicated(['userId','movieId']).sum()
0
movie_data_df_t['id'] = movie_data_df_t['id'].astype(np.int64)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
movie_data_df_t.dtypes
title object
id int64
dtype: object
ratings_df_s.dtypes
userId int64
movieId int64
rating float64
dtype: object
# 左dataframe 和 右dataframe 根据 movieId 和 id进行合并
ratings_df_s = pd.merge(ratings_df_s,movie_data_df_t, left_on='movieId',right_on='id')
ratings_df_s.head()
| userId | movieId | rating | title | id | |
|---|---|---|---|---|---|
| 0 | 1 | 1371 | 2.5 | Rocky III | 1371 |
| 1 | 4 | 1371 | 4.0 | Rocky III | 1371 |
| 2 | 7 | 1371 | 3.0 | Rocky III | 1371 |
| 3 | 19 | 1371 | 4.0 | Rocky III | 1371 |
| 4 | 21 | 1371 | 3.0 | Rocky III | 1371 |
ratings_df_s.drop(['id'],axis=1,inplace=True)
ratings_df_s
| userId | movieId | rating | title | |
|---|---|---|---|---|
| 0 | 1 | 1371 | 2.5 | Rocky III |
| 1 | 4 | 1371 | 4.0 | Rocky III |
| 2 | 7 | 1371 | 3.0 | Rocky III |
| 3 | 19 | 1371 | 4.0 | Rocky III |
| 4 | 21 | 1371 | 3.0 | Rocky III |
| ... | ... | ... | ... | ... |
| 44984 | 652 | 129009 | 4.0 | Love Is a Ball |
| 44985 | 653 | 2103 | 3.0 | Solaris |
| 44986 | 659 | 167 | 4.0 | K-PAX |
| 44987 | 659 | 563 | 3.0 | Starship Troopers |
| 44988 | 665 | 129 | 3.0 | Spirited Away |
44989 rows × 4 columns
ratings_df_s.shape
(44989, 4)
# 有评分记录的电影的个数
len(ratings_df_s['title'].unique())
2794
ratings_df_s['title'].unique()
array(['Rocky III', 'Greed', 'American Pie', ..., 'K-PAX',
'Starship Troopers', 'Spirited Away'], dtype=object)
ratings_df_s.groupby([ratings_df_s['title'],ratings_df_s['rating']]).count().reset_index()
| title | rating | userId | movieId | |
|---|---|---|---|---|
| 0 | !Women Art Revolution | 3.0 | 1 | 1 |
| 1 | !Women Art Revolution | 3.5 | 1 | 1 |
| 2 | 'Gator Bait | 0.5 | 1 | 1 |
| 3 | 'Twas the Night Before Christmas | 3.5 | 1 | 1 |
| 4 | 'Twas the Night Before Christmas | 4.5 | 1 | 1 |
| ... | ... | ... | ... | ... |
| 10263 | À nos amours | 4.0 | 5 | 5 |
| 10264 | À nos amours | 4.5 | 1 | 1 |
| 10265 | À nos amours | 5.0 | 1 | 1 |
| 10266 | Ödipussi | 4.5 | 1 | 1 |
| 10267 | Şaban Oğlu Şaban | 4.5 | 1 | 1 |
10268 rows × 4 columns
ratings_df_s.groupby(ratings_df_s['title']).count().reset_index()
| title | userId | movieId | rating | |
|---|---|---|---|---|
| 0 | !Women Art Revolution | 2 | 2 | 2 |
| 1 | 'Gator Bait | 1 | 1 | 1 |
| 2 | 'Twas the Night Before Christmas | 2 | 2 | 2 |
| 3 | ...And God Created Woman | 1 | 1 | 1 |
| 4 | 00 Schneider - Jagd auf Nihil Baxter | 2 | 2 | 2 |
| ... | ... | ... | ... | ... |
| 2789 | xXx | 28 | 28 | 28 |
| 2790 | ¡Three Amigos! | 1 | 1 | 1 |
| 2791 | À nos amours | 14 | 14 | 14 |
| 2792 | Ödipussi | 1 | 1 | 1 |
| 2793 | Şaban Oğlu Şaban | 1 | 1 | 1 |
2794 rows × 4 columns
ratings_df_s_allcounts = ratings_df_s.groupby(ratings_df_s['title'])['userId'].count().reset_index()
ratings_df_s_allcounts = ratings_df_s_allcounts.rename(columns = {'userId':'totalRatings'})
ratings_df_s_allcounts
| title | totalRatings | |
|---|---|---|
| 0 | !Women Art Revolution | 2 |
| 1 | 'Gator Bait | 1 |
| 2 | 'Twas the Night Before Christmas | 2 |
| 3 | ...And God Created Woman | 1 |
| 4 | 00 Schneider - Jagd auf Nihil Baxter | 2 |
| ... | ... | ... |
| 2789 | xXx | 28 |
| 2790 | ¡Three Amigos! | 1 |
| 2791 | À nos amours | 14 |
| 2792 | Ödipussi | 1 |
| 2793 | Şaban Oğlu Şaban | 1 |
2794 rows × 2 columns
ratings_df_s_allcounts.shape
(2794, 2)
ratings_df_s_allcounts['totalRatings'].describe()
count 2794.000000
mean 16.102004
std 31.481795
min 1.000000
25% 1.000000
50% 4.000000
75% 15.750000
max 324.000000
Name: totalRatings, dtype: float64
ratings_df_s_allcounts.hist()
array([[<AxesSubplot:title={'center':'totalRatings'}>]], dtype=object)

ratings_df_s_allcounts['totalRatings'].quantile(np.arange(0.6,1, 0.01)) #分位点
0.60 7.00
0.61 7.00
0.62 7.00
0.63 8.00
0.64 8.00
0.65 9.00
0.66 9.00
0.67 10.00
0.68 10.00
0.69 11.00
0.70 12.00
0.71 12.00
0.72 13.00
0.73 14.00
0.74 14.00
0.75 15.75
0.76 17.00
0.77 18.00
0.78 19.00
0.79 20.00
0.80 21.00
0.81 22.33
0.82 24.00
0.83 26.00
0.84 27.00
0.85 29.00
0.86 31.00
0.87 34.00
0.88 37.00
0.89 41.77
0.90 45.00
0.91 49.00
0.92 52.56
0.93 59.00
0.94 64.42
0.95 71.00
0.96 83.28
0.97 98.21
0.98 119.14
0.99 168.49
Name: totalRatings, dtype: float64
votes_count_threshold = 20
ratings_df_s_top=ratings_df_s_allcounts.query('totalRatings > @votes_count_threshold').reset_index()
ratings_df_s_top
| index | title | totalRatings | |
|---|---|---|---|
| 0 | 18 | 20,000 Leagues Under the Sea | 89 |
| 1 | 19 | 2001: A Space Odyssey | 87 |
| 2 | 24 | 24 Hour Party People | 22 |
| 3 | 26 | 28 Days Later | 26 |
| 4 | 27 | 28 Weeks Later | 47 |
| ... | ... | ... | ... |
| 575 | 2770 | Young Adam | 34 |
| 576 | 2772 | Young Frankenstein | 29 |
| 577 | 2774 | Young and Innocent | 193 |
| 578 | 2781 | Zatoichi | 61 |
| 579 | 2789 | xXx | 28 |
580 rows × 3 columns
ratings_df_s_top.drop(['index'],axis=1,inplace=True)
ratings_df_s_top.head()
| title | totalRatings | |
|---|---|---|
| 0 | 20,000 Leagues Under the Sea | 89 |
| 1 | 2001: A Space Odyssey | 87 |
| 2 | 24 Hour Party People | 22 |
| 3 | 28 Days Later | 26 |
| 4 | 28 Weeks Later | 47 |
ratings_df_s['title']
0 Rocky III
1 Rocky III
2 Rocky III
3 Rocky III
4 Rocky III
...
44984 Love Is a Ball
44985 Solaris
44986 K-PAX
44987 Starship Troopers
44988 Spirited Away
Name: title, Length: 44989, dtype: object
ratings_df_s_top['title']
0 20,000 Leagues Under the Sea
1 2001: A Space Odyssey
2 24 Hour Party People
3 28 Days Later
4 28 Weeks Later
...
575 Young Adam
576 Young Frankenstein
577 Young and Innocent
578 Zatoichi
579 xXx
Name: title, Length: 580, dtype: object
ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
| userId | movieId | rating | title | |
|---|---|---|---|---|
| 0 | 1 | 1371 | 2.5 | Rocky III |
| 1 | 4 | 1371 | 4.0 | Rocky III |
| 2 | 7 | 1371 | 3.0 | Rocky III |
| 3 | 19 | 1371 | 4.0 | Rocky III |
| 4 | 21 | 1371 | 3.0 | Rocky III |
| ... | ... | ... | ... | ... |
| 44507 | 624 | 3057 | 4.0 | Frankenstein |
| 44781 | 547 | 97936 | 3.0 | Sweet November |
| 44782 | 624 | 97936 | 3.0 | Sweet November |
| 44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
| 44985 | 653 | 2103 | 3.0 | Solaris |
34552 rows × 4 columns
ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])] #得到评分数量大于20的
| userId | movieId | rating | title | |
|---|---|---|---|---|
| 0 | 1 | 1371 | 2.5 | Rocky III |
| 1 | 4 | 1371 | 4.0 | Rocky III |
| 2 | 7 | 1371 | 3.0 | Rocky III |
| 3 | 19 | 1371 | 4.0 | Rocky III |
| 4 | 21 | 1371 | 3.0 | Rocky III |
| ... | ... | ... | ... | ... |
| 44507 | 624 | 3057 | 4.0 | Frankenstein |
| 44781 | 547 | 97936 | 3.0 | Sweet November |
| 44782 | 624 | 97936 | 3.0 | Sweet November |
| 44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
| 44985 | 653 | 2103 | 3.0 | Solaris |
34552 rows × 4 columns
ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])] # 得到评分数量小于20的
| userId | movieId | rating | title | |
|---|---|---|---|---|
| 1714 | 2 | 248 | 3.0 | Pocketful of Miracles |
| 1715 | 36 | 248 | 2.0 | Pocketful of Miracles |
| 1716 | 110 | 248 | 4.0 | Pocketful of Miracles |
| 1717 | 239 | 248 | 4.0 | Pocketful of Miracles |
| 1718 | 242 | 248 | 3.0 | Pocketful of Miracles |
| ... | ... | ... | ... | ... |
| 44983 | 652 | 127728 | 5.0 | 8:46 |
| 44984 | 652 | 129009 | 4.0 | Love Is a Ball |
| 44986 | 659 | 167 | 4.0 | K-PAX |
| 44987 | 659 | 563 | 3.0 | Starship Troopers |
| 44988 | 665 | 129 | 3.0 | Spirited Away |
10437 rows × 4 columns
ratings_df_s_cntD20 = ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntX20 = ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntD20.shape
(34552, 4)
ratings_df_s_cntX20.shape
(10437, 4)
ratings_df_s_cntD20.isna().sum() #检查有无缺失值
userId 0
movieId 0
rating 0
title 0
dtype: int64
ratings_df_s_cntD20.duplicated(['userId','title']).sum()
140
ratings_df_s_cntD20=ratings_df_s_cntD20.drop_duplicates(['userId','title']) # 只保留每个用户对每个电影的一条评论记录
ratings_df_s_cntD20
| userId | movieId | rating | title | |
|---|---|---|---|---|
| 0 | 1 | 1371 | 2.5 | Rocky III |
| 1 | 4 | 1371 | 4.0 | Rocky III |
| 2 | 7 | 1371 | 3.0 | Rocky III |
| 3 | 19 | 1371 | 4.0 | Rocky III |
| 4 | 21 | 1371 | 3.0 | Rocky III |
| ... | ... | ... | ... | ... |
| 44506 | 472 | 3057 | 3.0 | Frankenstein |
| 44507 | 624 | 3057 | 4.0 | Frankenstein |
| 44782 | 624 | 97936 | 3.0 | Sweet November |
| 44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
| 44985 | 653 | 2103 | 3.0 | Solaris |
34412 rows × 4 columns
ratings_df_s_cntD20.duplicated(['userId','title']).sum()
0
# 将一个dataframe的记录数据整合成表格,而且是按照pivot(‘index=xx’,’columns=xx’,’values=xx’)来整合的。还有另外一种写法,就是pivot(‘索引列’,‘列名’,‘值’)。
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20.pivot(index='userId',columns='title',values='rating')
ratings_df_s_cntD20_for_apriori
| title | 20,000 Leagues Under the Sea | 2001: A Space Odyssey | 24 Hour Party People | 28 Days Later | 28 Weeks Later | 300 | 48 Hrs. | 5 Card Stud | 7 Virgins | 8 Women | ... | Within the Woods | X-Men Origins: Wolverine | Y Tu Mamá También | Yankee Doodle Dandy | Yesterday | Young Adam | Young Frankenstein | Young and Innocent | Zatoichi | xXx |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| userId | |||||||||||||||||||||
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | 3.0 | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.5 | NaN | NaN |
| 4 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 5.0 | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN |
| 5 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.5 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 667 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 668 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 669 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 670 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 671 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN |
671 rows × 580 columns
ratings_df_s_cntD20_for_apriori= ratings_df_s_cntD20_for_apriori.fillna(0) #缺失值 填充0
def encode_units(x): # 有效评分规则, 1表示有效,0 表示无效
if x <= 0:
return 0
if x>0:
return 1
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20_for_apriori.applymap(encode_units)
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
ratings_df_s_cntD20_for_apriori.head()
| title | 20,000 Leagues Under the Sea | 2001: A Space Odyssey | 24 Hour Party People | 28 Days Later | 28 Weeks Later | 300 | 48 Hrs. | 5 Card Stud | 7 Virgins | 8 Women | ... | Within the Woods | X-Men Origins: Wolverine | Y Tu Mamá También | Yankee Doodle Dandy | Yesterday | Young Adam | Young Frankenstein | Young and Innocent | Zatoichi | xXx |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| userId | |||||||||||||||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 580 columns
ratings_df_s_cntD20_for_apriori.isna().sum() #检查是否有nan值
title
20,000 Leagues Under the Sea 0
2001: A Space Odyssey 0
24 Hour Party People 0
28 Days Later 0
28 Weeks Later 0
..
Young Adam 0
Young Frankenstein 0
Young and Innocent 0
Zatoichi 0
xXx 0
Length: 580, dtype: int64
frequent_itemsets = apriori(ratings_df_s_cntD20_for_apriori, min_support=0.10, use_colnames=True) #生成符合条件的频繁项集
frequent_itemsets.sort_values('support',ascending=False) #support降序排列的频繁项集
| support | itemsets | |
|---|---|---|
| 111 | 0.482861 | (Terminator 3: Rise of the Machines) |
| 130 | 0.463487 | (The Million Dollar Hotel) |
| 105 | 0.454545 | (Solaris) |
| 113 | 0.433681 | (The 39 Steps) |
| 69 | 0.408346 | (Monsoon Wedding) |
| ... | ... | ... |
| 1613 | 0.101341 | (Sleepless in Seattle, 5 Card Stud, The Tunnel) |
| 5455 | 0.101341 | (Beauty and the Beast, Rain Man, Terminator 3:... |
| 5454 | 0.101341 | (The Passion of Joan of Arc, Beauty and the Be... |
| 6769 | 0.101341 | (The Million Dollar Hotel, The Hours, Three Co... |
| 3108 | 0.101341 | (The Conversation, Men in Black II, The Millio... |
7327 rows × 2 columns
rules= association_rules(frequent_itemsets, metric="lift", min_threshold=1) #生成关联规则,只保留lift>1的部分
rules
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | (5 Card Stud) | (48 Hrs.) | 0.298063 | 0.298063 | 0.108793 | 0.365000 | 1.224575 | 0.019952 | 1.105413 |
| 1 | (48 Hrs.) | (5 Card Stud) | 0.298063 | 0.298063 | 0.108793 | 0.365000 | 1.224575 | 0.019952 | 1.105413 |
| 2 | (A Clockwork Orange) | (48 Hrs.) | 0.152012 | 0.298063 | 0.102832 | 0.676471 | 2.269559 | 0.057523 | 2.169625 |
| 3 | (48 Hrs.) | (A Clockwork Orange) | 0.298063 | 0.152012 | 0.102832 | 0.345000 | 2.269559 | 0.057523 | 1.294638 |
| 4 | (48 Hrs.) | (A Nightmare on Elm Street) | 0.298063 | 0.268256 | 0.156483 | 0.525000 | 1.957083 | 0.076526 | 1.540513 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 75531 | (The Hours) | (The Million Dollar Hotel, Terminator 3: Rise ... | 0.301043 | 0.126677 | 0.104322 | 0.346535 | 2.735585 | 0.066187 | 1.336449 |
| 75532 | (Terminator 3: Rise of the Machines) | (The Million Dollar Hotel, The Hours, Rain Man... | 0.482861 | 0.114754 | 0.104322 | 0.216049 | 1.882716 | 0.048912 | 1.129211 |
| 75533 | (Rain Man) | (The Million Dollar Hotel, The Hours, Terminat... | 0.295082 | 0.120715 | 0.104322 | 0.353535 | 2.928669 | 0.068701 | 1.360143 |
| 75534 | (Sissi) | (The Million Dollar Hotel, The Hours, Terminat... | 0.317437 | 0.117735 | 0.104322 | 0.328638 | 2.791347 | 0.066949 | 1.314143 |
| 75535 | (Solaris) | (The Million Dollar Hotel, The Hours, Terminat... | 0.454545 | 0.113264 | 0.104322 | 0.229508 | 2.026316 | 0.052838 | 1.150870 |
75536 rows × 9 columns
rules.sort_values('lift',ascending=False)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 1473 | (Muxmäuschenstill) | (Waiter) | 0.156483 | 0.120715 | 0.105812 | 0.676190 | 5.601529 | 0.086922 | 2.715438 |
| 1472 | (Waiter) | (Muxmäuschenstill) | 0.120715 | 0.156483 | 0.105812 | 0.876543 | 5.601529 | 0.086922 | 6.832489 |
| 38208 | (Titanic, Big Fish) | (Psycho, Rain Man) | 0.150522 | 0.131148 | 0.101341 | 0.673267 | 5.133663 | 0.081601 | 2.659215 |
| 38209 | (Psycho, Rain Man) | (Titanic, Big Fish) | 0.131148 | 0.150522 | 0.101341 | 0.772727 | 5.133663 | 0.081601 | 3.737705 |
| 38238 | (Titanic, Big Fish) | (Psycho, Solaris) | 0.150522 | 0.134128 | 0.102832 | 0.683168 | 5.093399 | 0.082642 | 2.732908 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 108 | (5 Card Stud) | (Men in Black II) | 0.298063 | 0.333830 | 0.110283 | 0.370000 | 1.108348 | 0.010781 | 1.057413 |
| 571 | (Bang, Boom, Bang) | (The 39 Steps) | 0.260805 | 0.433681 | 0.125186 | 0.480000 | 1.106804 | 0.012080 | 1.089075 |
| 570 | (The 39 Steps) | (Bang, Boom, Bang) | 0.433681 | 0.260805 | 0.125186 | 0.288660 | 1.106804 | 0.012080 | 1.039159 |
| 1137 | (Sissi) | (License to Wed) | 0.317437 | 0.301043 | 0.102832 | 0.323944 | 1.076070 | 0.007269 | 1.033874 |
| 1136 | (License to Wed) | (Sissi) | 0.301043 | 0.317437 | 0.102832 | 0.341584 | 1.076070 | 0.007269 | 1.036675 |
75536 rows × 9 columns
all_antecedents = [list(x) for x in rules['antecedents'].values]
desired_indices = [i for i in range(len(all_antecedents)) if len(all_antecedents[i]) == 1 and all_antecedents[i][0] == 'Batman Returns']
apriori_recommendations =rules.iloc[desired_indices,].sort_values(by=['lift'],ascending=False)
apriori_recommendations.head()
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 63981 | (Batman Returns) | (The Hours, Monsoon Wedding, Silent Hill, Rese... | 0.298063 | 0.107303 | 0.102832 | 0.345 | 3.215208 | 0.070849 | 1.362897 |
| 36084 | (Batman Returns) | (Reservoir Dogs, Wag the Dog, Silent Hill) | 0.298063 | 0.105812 | 0.101341 | 0.340 | 3.213239 | 0.069803 | 1.354830 |
| 63891 | (Batman Returns) | (Monsoon Wedding, Silent Hill, Reservoir Dogs,... | 0.298063 | 0.107303 | 0.101341 | 0.340 | 3.168611 | 0.069358 | 1.352572 |
| 63351 | (Batman Returns) | (Monsoon Wedding, Silent Hill, Reservoir Dogs,... | 0.298063 | 0.107303 | 0.101341 | 0.340 | 3.168611 | 0.069358 | 1.352572 |
| 36014 | (Batman Returns) | (The Hours, Reservoir Dogs, Silent Hill) | 0.298063 | 0.116244 | 0.108793 | 0.365 | 3.139936 | 0.074145 | 1.391741 |
apriori_recommendations_list = [list(x) for x in apriori_recommendations['consequents'].values]
print("Apriori Recommendations for movie: Batman Returns\n")
for i in range(5):
print("{0}:{1} with lift of {2}" .format(i+1, apriori_recommendations_list[i], apriori_recommendations.iloc[i,6]))
Apriori Recommendations for movie: Batman Returns
1:['The Hours', 'Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs'] with lift of 3.215208333333333
2:['Reservoir Dogs', 'Wag the Dog', 'Silent Hill'] with lift of 3.2132394366197183
3:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Sissi'] with lift of 3.168611111111111
4:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Rain Man'] with lift of 3.168611111111111
5:['The Hours', 'Reservoir Dogs', 'Silent Hill'] with lift of 3.139935897435898
apriori_single_recommendations = apriori_recommendations.iloc[[x for x in range(len(apriori_recommendations_list)) if len(apriori_recommendations_list[x]) ==1],]
apriori_single_recommendations_list = [list(x) for x in apriori_single_recommendations['consequents'].values]
print("Apriori single-movie Recommendations for movie: Batman Returns\n")
for i in range(5):
print("{0}: {1}, with lift of {2}".format(i+1,apriori_single_recommendations_list[i][0],apriori_single_recommendations.iloc[i,6]))
Apriori single-movie Recommendations for movie: Batman Returns
1: Reservoir Dogs, with lift of 2.6094444444444447
2: Ariel, with lift of 2.5397663551401872
3: Wag the Dog, with lift of 2.496744186046512
4: To Kill a Mockingbird, with lift of 2.478125
5: Romeo + Juliet, with lift of 2.4705000000000004
# 读取ratings_small.csv数据用于建模
ratings_small_path = "./movie_dataset/ratings_small.csv"
ratings_small_df = pd.read_csv(ratings_small_path)
ratings_small_df.shape
(100004, 4)
ratings_small_df.head()
| userId | movieId | rating | timestamp | |
|---|---|---|---|---|
| 0 | 1 | 31 | 2.5 | 1260759144 |
| 1 | 1 | 1029 | 3.0 | 1260759179 |
| 2 | 1 | 1061 | 3.0 | 1260759182 |
| 3 | 1 | 1129 | 2.0 | 1260759185 |
| 4 | 1 | 1172 | 4.0 | 1260759205 |
# 原始的movieId 并非从0到1 的连续值, 为方便更贱user-item矩阵, 重新排列movie_id
movie_id = ratings_small_df['movieId'].drop_duplicates()
movie_id = pd.DataFrame(movie_id)
movie_id['movieid'] = range(len(movie_id))
movie_id
| movieId | movieid | |
|---|---|---|
| 0 | 31 | 0 |
| 1 | 1029 | 1 |
| 2 | 1061 | 2 |
| 3 | 1129 | 3 |
| 4 | 1172 | 4 |
| ... | ... | ... |
| 99131 | 64997 | 9061 |
| 99159 | 72380 | 9062 |
| 99274 | 129 | 9063 |
| 99678 | 4736 | 9064 |
| 99820 | 6425 | 9065 |
9066 rows × 2 columns
ratings_small_df = pd.merge(ratings_small_df, movie_id, on =['movieId'], how='left')
ratings_small_df
| userId | movieId | rating | timestamp | movieid | |
|---|---|---|---|---|---|
| 0 | 1 | 31 | 2.5 | 1260759144 | 0 |
| 1 | 1 | 1029 | 3.0 | 1260759179 | 1 |
| 2 | 1 | 1061 | 3.0 | 1260759182 | 2 |
| 3 | 1 | 1129 | 2.0 | 1260759185 | 3 |
| 4 | 1 | 1172 | 4.0 | 1260759205 | 4 |
| ... | ... | ... | ... | ... | ... |
| 99999 | 671 | 6268 | 2.5 | 1065579370 | 7005 |
| 100000 | 671 | 6269 | 4.0 | 1065149201 | 4771 |
| 100001 | 671 | 6365 | 4.0 | 1070940363 | 1329 |
| 100002 | 671 | 6385 | 2.5 | 1070979663 | 1331 |
| 100003 | 671 | 6565 | 3.5 | 1074784724 | 2946 |
100004 rows × 5 columns
ratings_small_df = ratings_small_df[['userId','movieid','rating','timestamp']] #更新 movieId ----> movieid
ratings_small_df
| userId | movieid | rating | timestamp | |
|---|---|---|---|---|
| 0 | 1 | 0 | 2.5 | 1260759144 |
| 1 | 1 | 1 | 3.0 | 1260759179 |
| 2 | 1 | 2 | 3.0 | 1260759182 |
| 3 | 1 | 3 | 2.0 | 1260759185 |
| 4 | 1 | 4 | 4.0 | 1260759205 |
| ... | ... | ... | ... | ... |
| 99999 | 671 | 7005 | 2.5 | 1065579370 |
| 100000 | 671 | 4771 | 4.0 | 1065149201 |
| 100001 | 671 | 1329 | 4.0 | 1070940363 |
| 100002 | 671 | 1331 | 2.5 | 1070979663 |
| 100003 | 671 | 2946 | 3.5 | 1074784724 |
100004 rows × 4 columns
# 用户物品统计
# unique()是以 数组形式(numpy.ndarray)返回列的所有唯一值(特征的所有唯一值)
# nunique() Return number of unique elements in the object.即返回的是唯一值的个数
n_users = ratings_small_df.userId.nunique()
n_users
671
n_items = ratings_small_df.movieid.nunique()
n_items
9066
# 拆分数据集
from sklearn.model_selection import train_test_split
#按照训练集70% 测试集30%的比例 对数据进行拆分
train_data,test_data = train_test_split(ratings_small_df,test_size= 0.3)
train_data
| userId | movieid | rating | timestamp | |
|---|---|---|---|---|
| 69526 | 481 | 329 | 4.0 | 1437001087 |
| 41670 | 299 | 917 | 3.5 | 1344188856 |
| 49260 | 358 | 288 | 2.0 | 957480147 |
| 39317 | 287 | 3582 | 4.0 | 1470168974 |
| 35991 | 262 | 2094 | 3.0 | 1433899624 |
| ... | ... | ... | ... | ... |
| 6262 | 33 | 1095 | 2.0 | 1032769543 |
| 8504 | 56 | 367 | 2.0 | 1467005360 |
| 8540 | 56 | 1435 | 4.0 | 1467006577 |
| 77937 | 542 | 1496 | 1.0 | 1424966216 |
| 94226 | 624 | 476 | 3.0 | 1053249671 |
70002 rows × 4 columns
# 训练集 用户-物品 矩阵
user_item_matrix = np.zeros((n_users,n_items))
user_item_matrix.shape
(671, 9066)
# iterrows() : 将DataFrame迭代成(index ,series)
# iteritems(): 将DataFrame迭代成(列名,series)
# itertuples(): 将DataFrame迭代成元组
for line in train_data.itertuples():
user_item_matrix[line[1]-1,line[2]]=line[3]
user_item_matrix
array([[0., 3., 3., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
user_item_matrix.shape
(671, 9066)
# 构建用户相似矩阵 ---采用余弦距离
from sklearn.metrics.pairwise import pairwise_distances
# 相似度计算 定义余弦距离
user_similarity_m = pairwise_distances(user_item_matrix,metric='cosine') # 每个用户为1行数据,故此处不需要再进行转置
a=[[1,3],[2,2]]
a
[[1, 3], [2, 2]]
pairwise_distances(a,metric='euclidean')
array([[0. , 1.41421356],
[1.41421356, 0. ]])
b = np.array([[1,2],[1,3],[2,1]])
b
array([[1, 2],
[1, 3],
[2, 1]])
pairwise_distances(b,metric='euclidean') #结果数组的第一行第二列表示 a[0]与a[1]的距离
array([[0. , 1. , 1.41421356],
[1. , 0. , 2.23606798],
[1.41421356, 2.23606798, 0. ]])
pairwise_distances(b,metric='cosine')
array([[0. , 0.01005051, 0.2 ],
[0.01005051, 0. , 0.29289322],
[0.2 , 0.29289322, 0. ]])
b.shape
(3, 2)
b[1]
array([1, 3])
b[0]
array([1, 2])
user_similarity_m.shape
(671, 671)
user_similarity_m[0:5,0:5].round(2)
array([[0. , 1. , 1. , 0.94, 0.97],
[1. , 0. , 0.89, 0.93, 0.92],
[1. , 0.89, 0. , 0.93, 0.93],
[0.94, 0.93, 0.93, 0. , 0.94],
[0.97, 0.92, 0.93, 0.94, 0. ]])
user_similarity_m_triu = np.triu(user_similarity_m,k=1) #取得上三角数据
np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
array([1. , 1. , 0.938, ..., 0.934, 0.919, 0.814])
user_sim_nonzero = np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
np.percentile(user_sim_nonzero,np.arange(0,101,10))
array([0.316, 0.844, 0.885, 0.911, 0.93 , 0.947, 0.961, 0.976, 1. ,
1. , 1. ])
mean_user_rating = user_item_matrix.mean(axis=1)
mean_user_rating
array([0.00297816, 0.0198544 , 0.01301566, 0.06265167, 0.03027796,
0.01196779, 0.02404589, 0.03805427, 0.01114053, 0.0147805 ,
0.01047871, 0.01301566, 0.01615928, 0.004743 , 0.33984116,
0.01069932, 0.0991617 , 0.0147805 , 0.11780278, 0.02294286,
0.0443415 , 0.04936025, 0.21111846, 0.00683874, 0.00617692,
0.04031546, 0.00816236, 0.01555261, 0.00363997, 0.29643724,
0.02625193, 0.01080962, 0.03684094, 0.05702625, 0.0025921 ,
0.03000221, 0.01147143, 0.03838518, 0.0196338 , 0.01301566,
0.0592323 , 0.0196338 , 0.02172954, 0.00694904, 0.00512905,
0.01312597, 0.01069932, 0.14212442, 0.02371498, 0.01169204,
0.01069932, 0.01941319, 0.00893448, 0.01384293, 0.00838297,
0.14615045, 0.06254136, 0.01753805, 0.02090227, 0.02018531,
0.04197 , 0.0172623 , 0.02454225, 0.00739025, 0.00694904,
0.01544231, 0.02856828, 0.03331127, 0.02327377, 0.02856828,
0.00794176, 0.05035297, 0.42096845, 0.01544231, 0.0394882 ,
0.00573572, 0.07346128, 0.08234061, 0.00921024, 0.0100375 ,
0.05283477, 0.01235385, 0.04555482, 0.03424884, 0.0247077 ,
0.05084933, 0.00650783, 0.06281712, 0.02448709, 0.01577322,
0.04671299, 0.02779616, 0.0444518 , 0.0497463 , 0.08769027,
0.02288771, 0.03160159, 0.02332892, 0.0497463 , 0.00661813,
0.0173726 , 0.1978822 , 0.02437679, 0.02360468, 0.13942202,
0.01411869, 0.00656298, 0.00783146, 0.00628723, 0.04015001,
0.09541143, 0.00650783, 0.00959629, 0.00827267, 0.01351202,
0.00667328, 0.01389808, 0.05592323, 0.17185087, 0.03833002,
0.02503861, 0.00882418, 0.00937569, 0.02415619, 0.0666777 ,
0.01808957, 0.00573572, 0.09695566, 0.00595632, 0.09254357,
0.01433929, 0.0297816 , 0.03518641, 0.10065078, 0.00529451,
0.01637988, 0.02018531, 0.02024046, 0.02090227, 0.01213325,
0.00816236, 0.01103022, 0.02365983, 0.01158173, 0.01455989,
0.02134348, 0.01312597, 0.04081182, 0.06121774, 0.09557688,
0.01654533, 0.05664019, 0.01698654, 0.00926539, 0.01279506,
0.01604897, 0.08697331, 0.00562541, 0.04070152, 0.03132583,
0.02702405, 0.00915508, 0.02007501, 0.02421134, 0.10870285,
0.01422899, 0.00761085, 0.02790646, 0.03623428, 0.00590117,
0.01588352, 0.0051842 , 0.01169204, 0.00639753, 0.03551732,
0.05834988, 0.07059343, 0.03259431, 0.01080962, 0.00452239,
0.01108537, 0.04301787, 0.01433929, 0.01125083, 0.05625414,
0.0099272 , 0.1014229 , 0.02867858, 0.03320097, 0.02150893,
0.00739025, 0.01687624, 0.01886168, 0.01367748, 0.10533863,
0.02702405, 0.02051621, 0.01979925, 0.12221487, 0.05846018,
0.03910214, 0.01831017, 0.01086477, 0.00871388, 0.05542687,
0.00948599, 0.00672844, 0.01604897, 0.00452239, 0.00683874,
0.01610413, 0.2176263 , 0.18409442, 0.06254136, 0.01753805,
0.02465255, 0.03430399, 0.01433929, 0.03855063, 0.07842488,
0.00501875, 0.02658284, 0.01158173, 0.02625193, 0.00827267,
0.00921024, 0.00750055, 0.02349437, 0.00987205, 0.03347673,
0.00937569, 0.20416942, 0.00871388, 0.03160159, 0.04858813,
0.06507831, 0.01075447, 0.02432164, 0.0843812 , 0.07246856,
0.0147805 , 0.14328259, 0.08151335, 0.02195014, 0.04268696,
0.00739025, 0.07677035, 0.03595853, 0.0051842 , 0.05581293,
0.03628943, 0.01147143, 0.05840503, 0.03739246, 0.04252151,
0.00650783, 0.02768586, 0.01025811, 0.00926539, 0.01235385,
0.01047871, 0.13710567, 0.03000221, 0.01091992, 0.054379 ,
0.00959629, 0.01158173, 0.10324289, 0.00628723, 0.06176925,
0.02029561, 0.01158173, 0.0296713 , 0.01114053, 0.06585043,
0.00479815, 0.01808957, 0.0147805 , 0.01136113, 0.00838297,
0.02509376, 0.03524156, 0.05333113, 0.01433929, 0.10302228,
0.00915508, 0.0893448 , 0.02029561, 0.0049636 , 0.02090227,
0.02553497, 0.08018972, 0.02217075, 0.25672844, 0.06849768,
0.00634238, 0.03662034, 0.02647253, 0.11763733, 0.01389808,
0.00551511, 0.00750055, 0.06243106, 0.03309067, 0.00595632,
0.16997573, 0.02029561, 0.0148908 , 0.04594088, 0.00468784,
0.23830796, 0.07290977, 0.08112729, 0.01169204, 0.01246415,
0.03524156, 0.00573572, 0.01588352, 0.00595632, 0.01571807,
0.02283256, 0.01323627, 0.00700419, 0.04180454, 0.00446724,
0.00783146, 0.02073682, 0.04649239, 0.00584602, 0.02680344,
0.00689389, 0.00816236, 0.02503861, 0.01086477, 0.007666 ,
0.00816236, 0.00330907, 0.01323627, 0.02950585, 0.01384293,
0.00595632, 0.05868079, 0.01114053, 0.04500331, 0.0619347 ,
0.09055813, 0.00650783, 0.01621443, 0.00639753, 0.0495257 ,
0.01378778, 0.02443194, 0.1039047 , 0.01544231, 0.09039268,
0.00419148, 0.00948599, 0.15243768, 0.01483565, 0.0098169 ,
0.01533201, 0.03071917, 0.05404809, 0.00909993, 0.0224465 ,
0.0097066 , 0.05217295, 0.00628723, 0.01345687, 0.03055372,
0.0446724 , 0.00849327, 0.06165895, 0.00838297, 0.00705934,
0.01808957, 0.00645268, 0.03750276, 0.01990955, 0.28375248,
0.02945069, 0.07654975, 0.01544231, 0.11973307, 0.03132583,
0.02691374, 0.09276417, 0.22865652, 0.01246415, 0.03430399,
0.02923009, 0.00617692, 0.0125193 , 0.04511361, 0.00683874,
0.03540702, 0.01632473, 0.01544231, 0.00595632, 0.01676594,
0.024818 , 0.09303993, 0.00783146, 0.0098169 , 0.11675491,
0.0270792 , 0.10699316, 0.05978381, 0.01566292, 0.00799691,
0.00882418, 0.05129054, 0.00650783, 0.01698654, 0.00893448,
0.02724465, 0.04114273, 0.0494154 , 0.01643503, 0.02823737,
0.0101478 , 0.0296713 , 0.09458416, 0.00799691, 0.01588352,
0.06507831, 0.09458416, 0.04560997, 0.00457754, 0.09618354,
0.09303993, 0.02013016, 0.06221046, 0.05382749, 0.00606662,
0.02161924, 0.00683874, 0.00612177, 0.05779837, 0.01367748,
0.03568277, 0.07572248, 0.01775866, 0.00441209, 0.00540481,
0.00904478, 0.01808957, 0.00639753, 0.00871388, 0.03943305,
0.01599382, 0.33085153, 0.02294286, 0.0101478 , 0.00821752,
0.01660049, 0.14179351, 0.02272226, 0.00705934, 0.08283697,
0.15784249, 0.0121884 , 0.13335539, 0.01058901, 0.01119568,
0.0593426 , 0.02095742, 0.30228326, 0.0048533 , 0.01869623,
0.0569711 , 0.24652548, 0.02614163, 0.01301566, 0.14284139,
0.01114053, 0.00490845, 0.02774101, 0.03132583, 0.1185749 ,
0.1435032 , 0.01819987, 0.03259431, 0.00573572, 0.004743 ,
0.0398191 , 0.04037062, 0.01781381, 0.00672844, 0.0051842 ,
0.01875138, 0.01941319, 0.02923009, 0.02415619, 0.00617692,
0.03309067, 0.03419369, 0.0048533 , 0.01235385, 0.05741231,
0.05658504, 0.03353188, 0.01334657, 0.004743 , 0.09927201,
0.0051842 , 0.01125083, 0.01334657, 0.2351092 , 0.04367968,
0.00948599, 0.00921024, 0.00584602, 0.1037944 , 0.00876903,
0.03805427, 0.01411869, 0.19170527, 0.05619899, 0.03987426,
0.01384293, 0.06083168, 0.04003971, 0.01968895, 0.03992941,
0.00777631, 0.03171189, 0.03325612, 0.16804544, 0.02062652,
0.03298037, 0.01384293, 0.0394882 , 0.08030002, 0.01378778,
0.03011251, 0.10070593, 0.00739025, 0.01058901, 0.00551511,
0.00683874, 0.01704169, 0.01544231, 0.09265387, 0.02713435,
0.02178469, 0.63484447, 0.03562762, 0.00623208, 0.03353188,
0.02360468, 0.00783146, 0.06358923, 0.01511141, 0.01831017,
0.00959629, 0.01329142, 0.07224796, 0.04378998, 0.03253916,
0.07798368, 0.07026252, 0.04616148, 0.52404589, 0.00871388,
0.00777631, 0.01147143, 0.01180234, 0.02283256, 0.03634458,
0.01577322, 0.02950585, 0.0101478 , 0.09022722, 0.14284139,
0.01125083, 0.0917163 , 0.00805206, 0.00209574, 0.22887712,
0.00595632, 0.03502096, 0.00821752, 0.06072138, 0.09728657,
0.0150011 , 0.15938672, 0.01400838, 0.01047871, 0.02228105,
0.00849327, 0.03904699, 0.02128833, 0.02514891, 0.05118023,
0.14399956, 0.06243106, 0.07842488, 0.05757776, 0.01119568,
0.01268476, 0.03926759, 0.03617913, 0.00330907, 0.11096404,
0.0196338 , 0.12618575, 0.08879329, 0.02283256, 0.01913744,
0.01080962, 0.01742775, 0.01560777, 0.02889918, 0.10225017,
0.01069932, 0.01764836, 0.0100375 , 0.01257445, 0.04086698,
0.02614163, 0.01185749, 0.03105008, 0.39383411, 0.02079197,
0.04290757, 0.04500331, 0.0223362 , 0.00959629, 0.0075557 ,
0.00937569, 0.01185749, 0.00772116, 0.00534966, 0.00750055,
0.00739025, 0.00976175, 0.004743 , 0.01455989, 0.01191264,
0.04059122, 0.01169204, 0.00490845, 0.01125083, 0.007666 ,
0.05834988, 0.05162144, 0.07715641, 0.0245974 , 0.00827267,
0.00595632, 0.08509817, 0.01753805, 0.20257004, 0.03353188,
0.0445621 , 0.00419148, 0.01952349, 0.03827487, 0.02950585,
0.00843812, 0.01742775, 0.00871388, 0.15927642, 0.1088683 ,
0.00816236, 0.01687624, 0.00739025, 0.0098169 , 0.00716964,
0.0347452 ])
rating_diff = (user_item_matrix - mean_user_rating[:,np.newaxis]) # np.newaxis作用:为mean_user_rating增加一个维度,实现加减操作
rating_diff
array([[-2.97816016e-03, 2.99702184e+00, 2.99702184e+00, ...,
-2.97816016e-03, -2.97816016e-03, -2.97816016e-03],
[-1.98544011e-02, -1.98544011e-02, -1.98544011e-02, ...,
-1.98544011e-02, -1.98544011e-02, -1.98544011e-02],
[-1.30156629e-02, -1.30156629e-02, -1.30156629e-02, ...,
-1.30156629e-02, -1.30156629e-02, -1.30156629e-02],
...,
[-9.81689830e-03, -9.81689830e-03, -9.81689830e-03, ...,
-9.81689830e-03, -9.81689830e-03, -9.81689830e-03],
[-7.16964483e-03, -7.16964483e-03, -7.16964483e-03, ...,
-7.16964483e-03, -7.16964483e-03, -7.16964483e-03],
[-3.47452019e-02, -3.47452019e-02, -3.47452019e-02, ...,
-3.47452019e-02, -3.47452019e-02, -3.47452019e-02]])
user_prediction = mean_user_rating[:,np.newaxis] + user_similarity_m.dot(rating_diff) / np.array([np.abs(user_similarity_m).sum(axis=1)]).T
# 处以np.array([np.abs(item_similarity_m).sum(axis=1)]是为了可以使评分在1~5之间,使1~5的标准化
user_prediction
array([[ 8.48587738e-02, 1.11549860e-01, 7.78496257e-02, ...,
-3.30873704e-02, -3.59785123e-02, -3.59132569e-02],
[ 9.36489784e-02, 1.35396758e-01, 1.04357090e-01, ...,
-1.62815182e-02, -1.93136443e-02, -1.93247190e-02],
[ 9.44428457e-02, 1.33314515e-01, 9.83052575e-02, ...,
-2.28228892e-02, -2.58037344e-02, -2.59258365e-02],
...,
[ 9.29750987e-02, 1.27902780e-01, 9.32275326e-02, ...,
-2.60694824e-02, -2.89101875e-02, -2.87905826e-02],
[ 8.62056229e-02, 1.26697599e-01, 9.17810994e-02, ...,
-2.88942031e-02, -3.19119828e-02, -3.20590645e-02],
[ 1.17342284e-01, 1.50739909e-01, 1.17908253e-01, ...,
-7.69495365e-05, -2.99819315e-03, -3.02101562e-03]])
# 只取数据集中有评分的数据集进行评估
from sklearn.metrics import mean_squared_error
from math import sqrt
prediction_flatten = user_prediction[user_item_matrix.nonzero()]
prediction_flatten
array([0.11154986, 0.07784963, 0.14877094, ..., 0.04236321, 0.01114962,
0.02448394])
user_item_matrix_flatten = user_item_matrix[user_item_matrix.nonzero()]
user_item_matrix_flatten
array([3., 3., 2., ..., 4., 4., 4.])
error_test = sqrt(mean_squared_error(prediction_flatten,user_item_matrix_flatten)) # 均方根误差计算
error_test
3.390138302832629
我正在使用Sequel构建一个愿望list系统。我有一个wishlists和itemstable和一个items_wishlists连接表(该名称是续集选择的名称)。items_wishlists表还有一个用于facebookid的额外列(因此我可以存储opengraph操作),这是一个NOTNULL列。我还有Wishlist和Item具有续集many_to_many关联的模型已建立。Wishlist类也有:selectmany_to_many关联的选项设置为select:[:items.*,:items_wishlists__facebook_action_id].有没有一种方法可以
我有一个用户工厂。我希望默认情况下确认用户。但是鉴于unconfirmed特征,我不希望它们被确认。虽然我有一个基于实现细节而不是抽象的工作实现,但我想知道如何正确地做到这一点。factory:userdoafter(:create)do|user,evaluator|#unwantedimplementationdetailshereunlessFactoryGirl.factories[:user].defined_traits.map(&:name).include?(:unconfirmed)user.confirm!endendtrait:unconfirmeddoenden
我想为我的Rails网络应用程序提供推荐功能。特别是,我想向新注册的用户推荐他可能想要关注的其他用户。Rails中是否有用于此目的的引擎/gem?如果没有,我应该从哪里开始构建它?谢谢。 最佳答案 有Coletivogemhttps://github.com/diogenes/coletivo我试了一下。在MySQL上运行。Neo4jhttp://neo4j.org真的很容易实现一个“跟随谁”。事实上,大多数展示其能力的样本都涉及“跟随谁”。快速提示-只有在JRuby上运行时,Neo4j.rb才会很酷。如果不是-使用Neograph
我的问题的一个例子是体育游戏。一场体育比赛有两支球队,一支主队和一支客队。我的事件记录模型如下:classTeam"Team"has_one:away_team,:class_name=>"Team"end我希望能够通过游戏访问一个团队,例如:Game.find(1).home_team但我收到一个单元化常量错误:Game::team。谁能告诉我我做错了什么?谢谢, 最佳答案 如果Gamehas_one:team那么Rails假设您的teams表有一个game_id列。不过,您想要的是games表有一个team_id列,在这种情况下
目前,Itembelongs_toCompany和has_manyItemVariants。我正在尝试使用嵌套的fields_for通过Item表单添加ItemVariant字段,但是使用:item_variants不显示该表单。只有当我使用单数时才会显示。我检查了我的关联,它们似乎是正确的,这可能与嵌套在公司下的项目有关,还是我遗漏了其他东西?提前致谢。注意:下面的代码片段中省略了不相关的代码。编辑:不知道这是否相关,但我正在使用CanCan进行身份验证。routes.rbresources:companiesdoresources:itemsenditem.rbclassItemi
导读:随着叮咚买菜业务的发展,不同的业务场景对数据分析提出了不同的需求,他们希望引入一款实时OLAP数据库,构建一个灵活的多维实时查询和分析的平台,统一数据的接入和查询方案,解决各业务线对数据高效实时查询和精细化运营的需求。经过调研选型,最终引入ApacheDoris作为最终的OLAP分析引擎,Doris作为核心的OLAP引擎支持复杂地分析操作、提供多维的数据视图,在叮咚买菜数十个业务场景中广泛应用。作者|叮咚买菜资深数据工程师韩青叮咚买菜创立于2017年5月,是一家专注美好食物的创业公司。叮咚买菜专注吃的事业,为满足更多人“想吃什么”而努力,通过美好食材的供应、美好滋味的开发以及美食品牌的孵
电脑0x0000001A蓝屏错误怎么U盘重装系统教学分享。有用户电脑开机之后遇到了系统蓝屏的情况。系统蓝屏问题很多时候都是系统bug,只有通过重装系统来进行解决。那么蓝屏问题如何通过U盘重装新系统来解决呢?来看看以下的详细操作方法教学吧。 准备工作: 1、U盘一个(尽量使用8G以上的U盘)。 2、一台正常联网可使用的电脑。 3、ghost或ISO系统镜像文件(Win10系统下载_Win10专业版_windows10正式版下载-系统之家)。 4、在本页面下载U盘启动盘制作工具:系统之家U盘启动工具。 U盘启动盘制作步骤: 注意:制作期间,U盘会被格式化,因此U盘中的重要文件请注
目录一.加解密算法数字签名对称加密DES(DataEncryptionStandard)3DES(TripleDES)AES(AdvancedEncryptionStandard)RSA加密法DSA(DigitalSignatureAlgorithm)ECC(EllipticCurvesCryptography)非对称加密签名与加密过程非对称加密的应用对称加密与非对称加密的结合二.数字证书图解一.加解密算法加密简单而言就是通过一种算法将明文信息转换成密文信息,信息的的接收方能够通过密钥对密文信息进行解密获得明文信息的过程。根据加解密的密钥是否相同,算法可以分为对称加密、非对称加密、对称加密和非
华为OD机试题本篇题目:明明的随机数题目输入描述输出描述:示例1输入输出说明代码编写思路最近更新的博客华为od2023|什么是华为od,od薪资待遇,od机试题清单华为OD机试真题大全,用Python解华为机试题|机试宝典【华为OD机试】全流程解析+经验分享,题型分享,防作弊指南华为o
在应用开发中,有时候我们需要获取系统的设备信息,用于数据上报和行为分析。那在鸿蒙系统中,我们应该怎么去获取设备的系统信息呢,比如说获取手机的系统版本号、手机的制造商、手机型号等数据。1、获取方式这里分为两种情况,一种是设备信息的获取,一种是系统信息的获取。1.1、获取设备信息获取设备信息,鸿蒙的SDK包为我们提供了DeviceInfo类,通过该类的一些静态方法,可以获取设备信息,DeviceInfo类的包路径为:ohos.system.DeviceInfo.具体的方法如下:ModifierandTypeMethodDescriptionstatic StringgetAbiList()Obt