1 了解数据¶
数据来自kaggle,共包括三个文件:
- movies.dat
- ratings.dat
- users.dat
movies.dat
包括三个字段:[‘Movie ID’, ‘Movie Title’, ‘Genre’]
使用pandas导入此文件:
import pandas as pd
movies = pd.read_csv('./data/movietweetings/movies.dat', delimiter='::', engine='python', header=None, names = ['Movie ID', 'Movie Title', 'Genre'])
导入后,显示前5行:
Movie ID Movie Title \
0 8 Edison Kinetoscopic Record of a Sneeze (1894)
1 10 La sortie des usines Lumi猫re (1895)
2 12 The Arrival of a Train (1896)
3 25 The Oxford and Cambridge University Boat Race ...
4 91 Le manoir du diable (1896)
5 131 Une nuit terrible (1896)
6 417 Le voyage dans la lune (1902)
7 439 The Great Train Robbery (1903)
8 443 Hiawatha, the Messiah of the Ojibway (1903)
9 628 The Adventures of Dollie (1908)
Genre
0 Documentary|Short
1 Documentary|Short
2 Documentary|Short
3 NaN
4 Short|Horror
5 Short|Comedy|Horror
6 Short|Action|Adventure|Comedy|Fantasy|Sci-Fi
7 Short|Action|Crime|Western
8 NaN
9 Action|Short
次导入其他两个数据文件
users.dat
:
users = pd.read_csv('./data/movietweetings/users.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Twitter ID'])
print(users.head())
结果:
User ID Twitter ID
0 1 397291295
1 2 40501255
2 3 417333257
3 4 138805259
4 5 2452094989
5 6 391774225
6 7 47317010
7 8 84541461
8 9 2445803544
9 10 995885060
rating.data
:
ratings = pd.read_csv('./data/movietweetings/ratings.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Movie ID', 'Rating', 'Rating Timestamp'])
print(ratings.head())
结果:
User ID Movie ID Rating Rating Timestamp
0 1 111161 10 1373234211
1 1 117060 7 1373415231
2 1 120755 6 1373424360
3 1 317919 6 1373495763
4 1 454876 10 1373621125
5 1 790724 8 1374641320
6 1 882977 8 1372898763
7 1 1229238 9 1373506523
8 1 1288558 5 1373154354
9 1 1300854 8 1377165712