1 了解数据¶

数据来自kaggle，共包括三个文件：

movies.dat
ratings.dat
users.dat

movies.dat包括三个字段：[‘Movie ID’, ‘Movie Title’, ‘Genre’]

使用pandas导入此文件：

import pandas as pd

movies = pd.read_csv('./data/movietweetings/movies.dat', delimiter='::', engine='python', header=None, names = ['Movie ID', 'Movie Title', 'Genre'])

导入后，显示前5行：

   Movie ID                                        Movie Title  \
       8      Edison Kinetoscopic Record of a Sneeze (1894)
      10                La sortie des usines Lumi猫re (1895)
      12                      The Arrival of a Train (1896)
      25  The Oxford and Cambridge University Boat Race ...
      91                         Le manoir du diable (1896)
     131                           Une nuit terrible (1896)
     417                      Le voyage dans la lune (1902)
     439                     The Great Train Robbery (1903)
     443        Hiawatha, the Messiah of the Ojibway (1903)
     628                    The Adventures of Dollie (1908)
                                          Genre
                           Documentary|Short
                           Documentary|Short
                           Documentary|Short
                                         NaN
                                Short|Horror
                         Short|Comedy|Horror
Short|Action|Adventure|Comedy|Fantasy|Sci-Fi
                  Short|Action|Crime|Western
                                         NaN
                                Action|Short

次导入其他两个数据文件

users.dat:

users = pd.read_csv('./data/movietweetings/users.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Twitter ID'])
print(users.head())

结果：

   User ID  Twitter ID
      1   397291295
      2    40501255
      3   417333257
      4   138805259
      5  2452094989
      6   391774225
      7    47317010
      8    84541461
      9  2445803544
     10   995885060

rating.data:

ratings = pd.read_csv('./data/movietweetings/ratings.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Movie ID', 'Rating', 'Rating Timestamp'])
print(ratings.head())

结果：

   User ID  Movie ID  Rating  Rating Timestamp
      1    111161      10        1373234211
      1    117060       7        1373415231
      1    120755       6        1373424360
      1    317919       6        1373495763
      1    454876      10        1373621125
      1    790724       8        1374641320
      1    882977       8        1372898763
      1   1229238       9        1373506523
      1   1288558       5        1373154354
      1   1300854       8        1377165712