1 了解数据

数据来自kaggle,共包括三个文件:

  1. movies.dat
  2. ratings.dat
  3. users.dat

movies.dat包括三个字段:[‘Movie ID’, ‘Movie Title’, ‘Genre’]

使用pandas导入此文件:

import pandas as pd

movies = pd.read_csv('./data/movietweetings/movies.dat', delimiter='::', engine='python', header=None, names = ['Movie ID', 'Movie Title', 'Genre'])

导入后,显示前5行:

   Movie ID                                        Movie Title  \
0         8      Edison Kinetoscopic Record of a Sneeze (1894)
1        10                La sortie des usines Lumi猫re (1895)
2        12                      The Arrival of a Train (1896)
3        25  The Oxford and Cambridge University Boat Race ...
4        91                         Le manoir du diable (1896)
5       131                           Une nuit terrible (1896)
6       417                      Le voyage dans la lune (1902)
7       439                     The Great Train Robbery (1903)
8       443        Hiawatha, the Messiah of the Ojibway (1903)
9       628                    The Adventures of Dollie (1908)
                                          Genre
0                             Documentary|Short
1                             Documentary|Short
2                             Documentary|Short
3                                           NaN
4                                  Short|Horror
5                           Short|Comedy|Horror
6  Short|Action|Adventure|Comedy|Fantasy|Sci-Fi
7                    Short|Action|Crime|Western
8                                           NaN
9                                  Action|Short

次导入其他两个数据文件

users.dat:

users = pd.read_csv('./data/movietweetings/users.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Twitter ID'])
print(users.head())

结果:

   User ID  Twitter ID
0        1   397291295
1        2    40501255
2        3   417333257
3        4   138805259
4        5  2452094989
5        6   391774225
6        7    47317010
7        8    84541461
8        9  2445803544
9       10   995885060

rating.data:

ratings = pd.read_csv('./data/movietweetings/ratings.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Movie ID', 'Rating', 'Rating Timestamp'])
print(ratings.head())

结果:

   User ID  Movie ID  Rating  Rating Timestamp
0        1    111161      10        1373234211
1        1    117060       7        1373415231
2        1    120755       6        1373424360
3        1    317919       6        1373495763
4        1    454876      10        1373621125
5        1    790724       8        1374641320
6        1    882977       8        1372898763
7        1   1229238       9        1373506523
8        1   1288558       5        1373154354
9        1   1300854       8        1377165712