Subtitle Downloader Technical Documentation
This page contains how the service modules were coded and also how to add support for a new service.
Main Module
This module is responsible for detecting the type of service module to be used and calls the appropriate service module. A simple string search for the service name is done on the input URL to find the type of service. Errors are handled accordingly.
Hulu
We first require the page source of the video. The function createSoupObject() is responsible for this. For this purpose we use the requests module. We parse the HTML with the help of BeautifulSoup library. The getTitle function returns the title of the video. This is also used for naming the file. The title is present in the Soup Object. Example -
We then require the contentID of the video. This is also available in the HTML Source. This is one of the methodologies to get the content ID. If this fails the alternative method will be called. In the Beautiful soup text it can be found that every video has this parameter."content_id": "60535322"
So we first use '"'(quotes) as the delimiter and split the text. Then access the content ID from the returned list. The function getSmiSubtitlesLink returns the SMI subtitle link based on the contentID. The XML Link for any subtitle video is :
http://www.hulu.com/captions.xml?content_id=CONTENTID
If multiple languages are present we give the user an option to enter their choice. We then convert the SMI URL to a VTT URL as follows - http://assets.huluim.com/captions/380/60601380_US_en_en.smi ---> http://assets.huluim.com/captions_webvtt/380/60601380_US_en_en.vtt Then the subtitles are converted from VTT to SRT format in the standard way.
YouTube
We first require the page source of the video. The function createSoupObject() is responsible for this. For this purpose we use the requests module. We parse the HTML with the help of BeautifulSoup library. The getTitle function returns the title of the video. This is also used for naming the file.
Amazon
The subtitle URL for Amazon is present in this URL -
"PreURL":"https://atv-ps.amazon.com/cdp/catalog/GetPlaybackResources?",
"asin" : "" ,
"consumptionType" : "Streaming" ,
"desiredResources" : "SubtitleUrls" ,
"deviceID" : "b63345bc3fccf7275dcad0cf7f683a8f" ,
"deviceTypeID" : "AOAGZA014O5RE" ,
"firmware" : "1" ,
"marketplaceID" : "ATVPDKIKX0DER" ,
"resourceUsage" : "ImmediateConsumption" ,
"videoMaterialType" : "Feature" ,
"operatingSystemName" : "Linux" ,
"customerID" : "" ,
"token" : "" ,
"deviceDrmOverride" : "CENC" ,
"deviceStreamingTechnologyOverride" : "DASH" ,
"deviceProtocolOverride" : "Https" ,
"deviceBitrateAdaptationsOverride" : "CVBR,CBR" ,
"titleDecorationScheme" : "primary-content"
The primary parameters we need to get are ASIN ID, customerID and TOKEN. These are obtained from the config file. The config file is generated from the setup.py file. The setup.py file takes the users login and password and generates the config file. The ASINID is taken from the URL directly.
https://www.amazon.com/dp/B019DSWVYC/?autoplay=1
Now, add the parameters to the dictionary and generate the final URL. The final URL will look something like this -
This is where the Subtitle URL is present. We get a JSON response from this URL and it contains a subtitle URL with .dfxp format. We request that subtitle URL and download the subtitles. With BeautifulSoup and Python regex we convert this dfxp to .srt format. (File - Amazon_XmlToSrt.py)
BBC
We first need to extract the episode ID from the URL. Sample URL -
http://www.bbc.co.uk/iplayer/episode/p03rkqcv/shakespeare-lives-the-works
The episode ID is p03rkqcv. The episode PID and episode Title(for naming the file) are present in the URL -
http://www.bbc.co.uk/programmes/
<episode_id>.xml
The subtitle URL is present in the following link -
http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/pc/vpid/
The PID is nothing but the episode PID obtained above. There are multiple PID's present. So, we try all the URL's until the page request is successful. If the request is successful we get the subtitle link by parsing the XML page using Beautiful Soup. The subtitles obtained are in XML format. They are converted to .srt by using BeautifulSoup function calls and regex. The conversion takes place in the file Bbc_XmlToSrt.py
CrunchyRoll
This is one of the methodologies to get the subtitles ID. In the
Beautiful soup text it can be found that every video has this parameter.
https://www.crunchyroll.com/xml/?req=RpcApiSubtitle_GetXml&subtitle_script_id=206027
The encrypted subtitles are extracted from the above URL. The decryption of these subtitles has been taken from another Open Source software : youtube-dl.
Netflix
The user needs to input his username and password of Netflix in the userconfig.ini file. Netflix requires login to download the subtitles.
We use python-selenium browser to automate the process. The first step is to login to Netflix with the config file information. Chrome WebDriver is used as the driver for selenium. After a successful login from selenium browser, we request for the video URL. The chrome Network tab gives a list of resources fetched from the server. We use the command :
return window.performance.getEntries();
This command returns all the fetched URL's. It was observed that all the Netflix videos had this sub-string in common and it was unique. /?o So we query for /?o and let the browser fetch the resources until we find such a URL. If we do not find the URL before the time out, we exit the application. If such a URL is found we save the URL and follow the standard procedure. We request the URL using requests module and save the file. The module //Netflix_XmlToSrt.py// is used to convert XML to .srt format.
FOX
We first require the page source of the video. The function createSoupObject() is responsible for this. For this purpose we use the requests module. We parse the HTML with the help of BeautifulSoup library.
The video URL follows a specific standard throughout.
http://www.fox.com/watch/684171331973/7684520448
We need to split and return "684171331973". This is the required
contentID.
This is the alternative method to obtain the contentID. In the soup text there is a meta tag which also contains the video URL. This is helpful in case the user inputs a shortened URL.
~~~ As stated above we split the URL and return the require
contentID, //684171331973// The other parameters required for obtaining
the subtitle URL are also present in the HTML page source.The required script content looks like this-
jQuery.extend(Drupal.settings, {"":...............});
` *We add everything to a new string after encountering the first "{".`\
` *Remove the last parentheses and the semi-colon to create a valid JSON. ---- ');'`
The JSON has the standard format and the required parameters follow
this naming. The json content :
`{"foxProfileContinueWatching":{"showid":"empire","showname":"Empire"},..............`
`"foxAdobePassProvider": {......,"videoGUID":"2AYB18"}}`
We use the json module to parse the json and extract the parameters
namely //showid// , //showname// , //videoGUID//
Sample Subtitle Links -
` `[`http://static-media.fox.com/cc/sleepy-hollow/SleepyHollow_3AWL18_660599363942.srt`](http://static-media.fox.com/cc/sleepy-hollow/SleepyHollow_3AWL18_660599363942.srt)\
` `[`http://static-media.fox.com/cc/sleepy-hollow/SleepyHollow_3AWL18_660599363942.dfxp`](http://static-media.fox.com/cc/sleepy-hollow/SleepyHollow_3AWL18_660599363942.dfxp)
The standard followed is -
` `[`http://static-media.fox.com/cc/[showid]/showname_videoGUID_contentID.srt`](http://static-media.fox.com/cc/%5Bshowid%5D/showname_videoGUID_contentID.srt)\
` `[`http://static-media.fox.com/cc/[showid]/showname_videoGUID_contentID.dfxp`](http://static-media.fox.com/cc/%5Bshowid%5D/showname_videoGUID_contentID.dfxp)
Some Subtitle URL's follow this standard -
` `[`http://static-media.fox.com/cc/[showid]/showname_videoGUID.dfxp`](http://static-media.fox.com/cc/%5Bshowid%5D/showname_videoGUID.dfxp)\
` `[`http://static-media.fox.com/cc/[showid]/showname_videoGUID.srt`](http://static-media.fox.com/cc/%5Bshowid%5D/showname_videoGUID.srt)
So we store both URL's and check for both the varieties. We request
both the varieties of URL and save the subtitles file when a successful
request is returned.
### General rules
Each service has a unique way of fetching the subtitles from the server.
We can get to know the methodology by following some steps -
` *The easiest way is to first open the Developer tools in Chrome/Firefox and check for XHR requests. Generally we find the subtitle URL's here.`\
` *The next step is to find out a general pattern in the subtitle URL's of that particular service.`\
` *If a pattern is found, it is most likely that we can request the subtitle page by forming the URL's from the required parameters. `\
` *Generally, the parameters can be found in the HTML page source. We need to search for them and query the URL.`\
` *Sometimes the required parameters for the URL are found in some other links in JSON format. A quick check of the fetched JSON resources will reveal the availability of them.`\
` *For services such as Netflix, the parameters have some kind of hashing in them which is difficult to decrypt. In such cases we can use selenium browser and search for keywords like **.srt**, **.dfxp**, **cc**, **sub**`\
` *By checking for multiple videos we can find out common sub-strings in the subtitle URLs. These common sub-strings(have to be unique) can be used for querying the resources from selenium browser.`\
` *In most cases, the subtitle URL is fetched only if the user is logged in. So we first need to setup login and then go to the video URL in the WebDriver.`\
` *The subtitles can then be downloaded from the URLs. `
If you are a developer and want to add support for new services or fix bugs please feel free to send a pull request or contact me for further assistance.
---------------------------------------------------------------------------------------------------------------------------------------------------------