Sunday, September 11, 2011

opensubtitles.com pretending to be big

For one of my datamining projects, I decided to spend a weekend trying to find out how many subtitles you can actually get online for free. It took a bit of parallel python writing to scrape an entire website but I found a surprising discrepancy between the claimed database size and the actual number of subtitles they provide.


Opensubtitles.com claims to be and is addressed in various blogs as the biggest subtitle database online.

Claimed statistics:
68912 - movie count
1394569 - subtitle count

Actual statistics:
5486 - movie count
20003 - subtitle count

with top 5 languages being:
4029 English
2680 Polish
1876 Portuguese-BR
1643 Romanian
1061 Spanish

They are hiding the actual deficit in plain sight, check e.g.:
these are 2 consecutive pages in the list of subtitles they provide. Starting with the page 27 and 28 the rest of the 1438 pages provide identical contents.

On the other hand, this could be some kind of very strange bug on their part. But it seems very unlikely given the nature of the issue.

3 comments:

  1. Hi,

    admin of opensubtitles.org speaking (creator:). Thanks for nice review. Limit for 1000 results is not bug, but feature - check the google. I did it because of subtitles leeching, there are other people, who want to download all of subtitles, for any reasons.
    Also, there are some hidden protections, so if you get some errors, there are usually not coincident, but just because, site detects some not normal behavior.

    If you are doing such a research on my site, and other site, you should contact authors first. On website is currently active 1429493 subtitles.

    ReplyDelete
  2. Hi,

    hmm my code reloaded if it saw an error so I can't comment on that, but I understand what you mean and what you wanted to achieve. It might be better though to do it in a bit more transparent way - say show first 20 pages of XYZ and then don't let the user page beyond the first 20. The way it is now is just too confusing.

    I tried google, it gave me the first 100 pages and it didn't allow to page further - exactly what I am suggesting ;-)

    By the way, would you be open to some kind of cooperation? I had a couple of data-mining ideas I wanted to try, it might be fun to surface a result of that on your website.

    ReplyDelete
  3. you can contact me on my mail, and you can write also in Slovak language.

    ReplyDelete