authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
胡安·曼努埃尔·奥尔蒂斯·德·萨拉特的头像

胡安·曼纽尔·奥尔蒂斯·德·萨拉特

Juan(计算机科学硕士)是一名数据科学/人工智能博士生. 作为一名高级web开发人员,他的主要专长包括R、Python和PHP.

Previously At

布宜诺斯艾利斯大学
Share

电影观众有时会用排名来选择看什么. Once doing this myself, I noticed that many of the best-ranked movies belonged to the same genre: drama. 这让我觉得这个排名可能存在某种类型偏见.

我在一个最受电影爱好者欢迎的网站上, IMDb, 哪个涵盖了世界各地和任何年份的电影. 其著名的排名是基于大量的评论. 对于这个IMDb数据分析, I decided to download all the information available there to analyze it and try to create a new, 将考虑更广泛的标准的改进排名.

IMDb评级系统:过滤IMDb的数据

I was able to download information on 242,528 movies released between 1970 and 2019 inclusive. IMDb给我的每一个信息是: Rank, Title, ID, Year, Certificate, Rating, Votes, Metascore, Synopsis, Runtime, Genre, Gross, and SearchYear.

有足够的信息来分析, 我需要每部电影的最低评论数, 所以我做的第一件事就是过滤那些评论少于500的电影. 这导致了一组33,296 movies, and in the next table, 我们可以看到对其字段的总结分析:

FieldTypeNull CountMeanMedian
RankFactor0  
TitleFactor0  
IDFactor0  
YearInt020032006
CertificateFactor17587  
RatingInt06.16.3
VotesInt0210402017
MetascoreInt2235055.356
SynopsisFactor0  
RuntimeInt132104.9100
GenreFactor0  
GrossFactor21415  
SearchYearInt020032006

Note: In R, Factor refers to strings. Rank and Gross are that way in the original IMDb dataset due to having, for example, thousands of separators.

在开始细化分数之前,我必须进一步分析这个数据集. 首先是田地 Certificate, Metascore, and Gross 有超过50%的空值,所以它们是无用的. 排名本质上取决于评级(要改进的变量), therefore, 它没有任何有用的信息. The same is true with ID 因为它是每个电影的唯一标识符.

Finally, Title and Synopsis are short text fields. 可以通过一些NLP技巧来使用它们, 而是因为文本数量有限, 我决定在这次任务中不把它们考虑在内.

经过第一次筛选,我得到 Genre, Rating, Year, Votes, SearchYear, and Runtime. In the Genre Field,每部电影不止一种类型,用逗号分隔. So to capture the additive effect of having many genres, I transformed it using one-hot encoding. This resulted in 22 new boolean fields—one for each genre—with a value of 1 if the movie had this genre or 0 otherwise.

IMDb Data Analysis

为了了解变量之间的相关性,我计算了 correlation matrix.

A correlation matrix among all the remaining original columns and the new genre columns. 接近于零的数字会导致网格中出现空白. Negative correlations result in red dots and positive correlations in blue dots. 这些点越大,颜色越深,相关性越强. (文章正文中描述了视觉亮点.)

Here, 接近1的值表示强烈的正相关, 接近-1的值表示强烈的负相关. 通过这张图表,我做了很多观察:

  • Year and SearchYear 是绝对相关的. This means that they probably have the same values and that having both is the same as having only one, so I kept only Year.
  • 一些领域预期会出现正相关,例如:
    • Music with Musical
    • Action with Adventure
    • Animation with Adventure
  • 负相关也一样:
    • Drama vs. Horror
    • Comedy vs. Horror
    • Horror vs. Romance
  • 与关键变量(Rating) I noticed:
    • 它与。有重要的正相关 Runtime and Drama.
    • 它的相关性较低 Votes, Biography, and History.
    • 它与。呈显著负相关 Horror 还有一个低一点的负的 Thriller, Action, Sci-Fi, and Year.
    • 它没有任何其他显著的相关性.

It seemed to be that long dramas were well-rated, while short horror movies weren’t. In my opinion—I didn’t have the data to check it—it didn’t correlate with the kind of movies that generate more profits, 比如漫威或皮克斯的电影.

It could be that the people who vote on this site are not the best representative of the general people criterion. It makes sense because those who take the time to submit reviews on the site are probably some sort of movie critics with a more specific criterion. Anyway, 我的目标是消除普通电影特征的影响, 所以我试图在这个过程中消除这种偏见.

IMDb分级系统中的类型分布

下一步是分析每个类型在评分中的分布. 为此,我创建了一个名为 Principal_Genre 根据原著中出现的第一种体裁改编 Genre field. 为了形象化,我做了一个 violin graph.

小提琴曲线图,显示每种类型的评分分布.

再来一次,我能看出来 Drama 与高收视率和 Horror with lower. 然而,这张图表也显示了其他类型的游戏也获得了不错的分数: Biography and Animation. That their correlations didn’t appear in the previous matrix was probably because there were too few movies with these genres. 接下来我按类型制作了一个频率条形图.

一个条形图,显示了数据库中每种类型的电影数量. Comedy, Drama, 和Action的频率在6左右,000 or above; Crime and Horror were above 2,000; the rest were under 1,000.

Effectively, Biography and Animation 很少有电影,是吗 Sport and Adult. 由于这个原因,它们的相关性不是很好 Rating.

IMDb评级系统中的其他变量

之后,我开始分析连续协变量: Year, Votes, and Runtime. 在散点图中,你可以看到 Rating and Year.

等级和年份的散点图.

As we saw previously, Year 似乎与。呈负相关 Rating: As the year increases, 评级差异也会增加, 在新电影中达到更多的负值.

接下来,我做了同样的图 Votes.

评分和投票的散点图.

Here, the correlation was clearer: the higher the number of votes, the higher the ranking. 然而,大多数电影都没有那么多的选票,在这种情况下, Rating had a bigger variance.

最后,我研究了与 Runtime.

评级和运行时间之间的散点图.

Again, we have a similar pattern but even stronger: Higher runtimes mean higher ratings, 但是很少有高运行时的情况.

IMDb评级系统的改进

在所有这些分析之后, 我对我正在处理的数据有了更好的了解, 所以我决定测试一些模型来预测基于这些字段的评分. My idea was that the difference between my best model predictions and the real Rating would remove the common features’ influence and reflect the particular characteristics that make a movie better than others.

我从最简单的线性模型开始. 为了评估哪个模型表现更好,我观察了均方根(RMSE) and mean absolute (MAE) errors. 它们是完成这类任务的标准措施. Also, they are on the same scale as the predicted variable, so they are easy to interpret.

在第一个模型中,RMSE为1.03, and MAE 0.78. But linear models suppose independence over the errors, a median of zero, and constant variance. 如果这是正确的,“残差vs. 预测值的图形应该看起来像没有结构的云. 所以我决定用图表来证实这一点.

Residual vs. 预测值散点图.

我可以在预测值中看到7, 它的形状是非结构化的, but after this value, 它有一个清晰的线性下降形状. Consequently, 模型的假设是错误的, and also, 我有一个“溢出”的预测值,因为在现实中, Rating can’t be more than 10.

在前面的IMDb数据分析中,用了较高的量 Votes, the Rating improved; however, this happened in a few cases and for a huge amount of votes. 这可能会导致模型扭曲并产生这种情况 Rating overflow. To check this, I evaluated what would happen with this same model, removing the Votes field.

Residual vs. 当投票字段被删除时,预测值散点图.

This was much better! 它具有更清晰、非结构化的形状,没有溢出预测值. The Votes 现场也取决于评论家的活动,而不是电影的一个特点, 所以我决定放弃这个领域. 删除后的误差为1.06 on RMSE and 0.81在mae上,稍微差一点, but not so much, and I preferred to have better suppositions and feature selection than a little better performance on my training set.

IMDb数据分析:其他模型的效果如何?

The next thing I did was to try different models to analyze which performed better. 对于每个模型,我使用 random search 技术优化超参数值和5倍 cross-validation to prevent model bias. 得到的估计误差如下表所示:

ModelRMSEMAE
Neural Network1.0445960.795699
Boosting1.0466390.7971921
Inference Tree1.057040.8054783
GAM1.06151080.8119555
Linear Model1.0665390.8152524
Penalized Linear Reg1.0666070.8153331
KNN1.0667140.8123369
Bayesian Ridge1.0689950.8148692
SVM1.0734910.8092725

As you can see, all models perform similarly, so I used some of them to analyze a little more data. 我想知道每个领域对评分的影响. The simplest way to do that is by observing the parameters of the linear model. 但为了避免之前对它们的扭曲, 我缩放了数据,然后重新训练了线性模型. 重量如图所示.

线性模型权重的柱状图,范围从接近-0.《恐怖》25分,接近0分.25 for Drama.

在这张图中,很明显两个最重要的变量是 Horror and Drama, where the first has a negative impact on the rating and the second a positive. 还有其他一些领域也会产生积极的影响 Animation and Biography—while Action, Sci-Fi, and Year impact negatively. Moreover, Principal_Genre 是否有相当大的影响, so it’s more important which genres a movie has than which one is the principal.

用广义加性模型(GAM), 我还可以看到对连续变量的更详细的影响, 在这种情况下,哪个是 Year.

A graph of Year vs. s(年)使用广义加性模型. s(Year)值遵循一条从0附近开始的曲线.6 for 1970, bottoming out below 0 at 2010, and increasing to near 0 again by 2019.

这里,我们有一些更有趣的东西. While it was true that for recent movies, the rating tended to be lower, the effect was not constant. 它在2010年达到最低值,然后似乎“复苏”.” It would be intriguing to find out what happened after that year in movie production that could have produced this change.

The best model was neural networks, which had the lowest RMSE and MAE, but as you can see, no model reached perfect performance. 但就我的目标而言,这并不是坏消息. The information available let me estimate the performance somewhat well, but it is not enough. 还有一些我无法从IMDb中获取的信息 Rating 与预期分数的差异基于 Genre, Runtime, and Year. 它可能是演员表演、电影剧本、摄影或许多其他东西.

From my perspective, these other characteristics are what really matters in selecting what to watch. 我不在乎一部电影是剧情片、动作片还是科幻片. 我想要一些特别的东西, 能让我玩得开心的东西, 让我学到了一些东西, 让我反思现实, or just entertains me.

So I created a new, refined rating by taking the IMDb rating and subtracting the predicted rating of the best model. 通过这样做,我正在消除的影响 Genre, Runtime, and Year 保留这些未知的信息对我来说更重要.

IMDb评级系统替代方案:最终结果

现在让我们看看哪10部电影是我的新评级和. IMDb的真实评分:

IMDb

TitleGenreIMDb RatingRefined Rating
Ko to tamo pevaAdventure,Comedy,Drama8.91.90
Dipu Number 2Adventure,Family8.93.14
El señor de los anillos: El reverno del reyAdventure,Drama,Fantasy8.92.67
El señor de los anillos: La comunidad del anilloAdventure,Drama,Fantasy8.82.55
Anbe SivamAdventure,Comedy,Drama8.82.38
Hababam Sinifi TatildeAdventure,Comedy,Drama8.71.66
El señor de los anillos: Las dos torresAdventure,Drama,Fantasy8.72.46
Mudras Calling冒险、戏剧、浪漫8.72.34
InterestelarAdventure,Drama,Sci-Fi8.62.83
Volver al futuroAdventure,Comedy,Sci-Fi8.52.32

Mine

TitleGenreIMDb RatingRefined Rating
Dipu Number 2Adventure,Family8.93.14
InterestelarAdventure,Drama,Sci-Fi8.62.83
El señor de los anillos: El reverno del reyAdventure,Drama,Fantasy8.92.67
El señor de los anillos: La comunidad del anilloAdventure,Drama,Fantasy8.82.55
Kolah ghermezi va pesar khale冒险,喜剧,家庭8.12.49
El señor de los anillos: Las dos torresAdventure,Drama,Fantasy8.72.46
Anbe SivamAdventure,Comedy,Drama8.82.38
la mesa cuadrada的骑士冒险、喜剧、幻想8.22.35
Mudras Calling冒险、戏剧、浪漫8.72.34
Volver al futuroAdventure,Comedy,Sci-Fi8.52.32

正如你所看到的,领奖台并没有发生根本性的变化. This was expected because the RMSE was not so high, and here we are watching the top. 让我们来看看排名最后的10位发生了什么:

IMDb

TitleGenreIMDb RatingRefined Rating
Holnap történt -一个纳吉bulvárfilmComedy,Mystery1-4.86
Cumali Ceber:真主保佑Comedy1-4.57
BadangComedy,Fantasy1-4.74
Yyyreek!!! Kosmiczna nominacjaComedy1.1-4.52
Proud AmericanDrama1.1-5.49
棕大衣:独立战争Action,Sci-Fi,War1.1-3.71
The Weekend It LivesComedy,Horror,Mystery1.2-4.53
Bolívar: el héroeAnimation,Biography1.2-5.34
Rise of the Black BatAction,Sci-Fi1.2-3.65
HatsukoiDrama1.2-5.38

Mine

TitleGenreIMDb RatingRefined Rating
Proud AmericanDrama1.1-5.49
圣诞老人和冰淇淋兔Family,Fantasy1.3-5.42
HatsukoiDrama1.2-5.38
ReisBiography,Drama1.5-5.35
Bolívar: el héroeAnimation,Biography1.2-5.34
Hanum & Rangga: Faith & The CityDrama,Romance1.2-5.28
After Last SeasonAnimation,Drama,Sci-Fi1.7-5.27
Barschel - Mord在GenfDrama1.6-5.23
Rasshu raifuDrama1.5-5.08
KamifûsenDrama1.5-5.08

这里也发生了同样的事情, 但现在我们可以看到,在精炼版中出现的电视剧比在IMDb中出现的多, 这说明一些电视剧可能仅仅因为是电视剧而排名过高.

Maybe the most interesting podium to see is the 10 movies with the greatest difference between the IMDb rating system’s score and my refined one. These movies are the ones that have more weight on their unknown characteristics and make the movie much better (or worse) than expected for its known features.

TitleIMDb RatingRefined RatingDifference
Kanashimi no beradonna7.4-0.718.11
Jesucristo Superstar7.4-0.698.09
Pink Floyd The Wall8.10.038.06
Tenshi no tamago7.6-0.428.02
Jibon Theke Neya9.41.527.87
El baile7.80.007.80
圣诞老人和三只熊7.1-0.707.80
斯克鲁奇的历史7.5-0.247.74
Piel de asno7-0.747.74
17767.6-0.117.71

如果我是一个电影导演,必须制作一部新电影, 在做了所有这些IMDb数据分析之后, I could have a better idea of what kind of movie to make to have a better IMDb ranking. It would be a long animated biography drama that would be a remake of an old movie—for example, Amadeus. Probably this would assure a good IMDb ranking, but I’m not sure about profits…

你怎么看在这个新标准中排名靠前的电影? Do you like them? 还是你更喜欢原版的? 请在下面的评论中告诉我!

了解基本知识

  • IMDb代表什么?

    IMDb (the Internet Movie Database) is an online database of information related to audiovisual content.

  • IMDb的评级系统是什么?

    The IMDb rating system is a way of ordering audiovisual content by a score generated through the votes of its web users.

  • IMDb是什么类型的数据库?

    IMDb的主要数据是关于电影的:它们存储标题, year, gross, duration, genre, 还有其他共同特征.

  • IMDb的目的是什么?

    IMDb’s purpose is to be the biggest, principal encyclopedia of audiovisual content.

就这一主题咨询作者或专家.
Schedule a call
胡安·曼努埃尔·奥尔蒂斯·德·萨拉特的头像

Located in 布宜诺斯艾利斯城,阿根廷布宜诺斯艾利斯

Member since November 6, 2019

About the author

Juan(计算机科学硕士)是一名数据科学/人工智能博士生. 作为一名高级web开发人员,他的主要专长包括R、Python和PHP.

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Previously At

布宜诺斯艾利斯大学

世界级的文章,每周发一次.

订阅意味着同意我们的 privacy policy

世界级的文章,每周发一次.

订阅意味着同意我们的 privacy policy

Toptal Developers

Join the Toptal® community.