Table of Contents
Web Scraping comes in, when we’re in a need to collect information from different web pages without any manual process and we get it done using a smart script. Also when there is no web-based API or prior to Web API’s to share the data with our app, and if you still want to extract some data from that website then we have to fallback on web scraping.
This blog focus on building a simple web scraper that gets some general movie information from RAAGA, a melodic framework for Indian languages. It takes for example, “Yuvans-Romantic-Hit-Songs” album from “https://www.raaga.com/tamil/movie/Mounam-Pesiyadhae-songs-T0000452” and collects the Song title, Singers Name, Lyricist.
Technologies we used to accomplish,
NodeJS: JavaScript runtime built on Chrome’s V8 JavaScript engine
ExpressJS: Fast, unopinionated, minimalist web framework for Node.js.
Request: Helps us to make HTTP calls
Cheerio: Implementation of core jQuery specifically for the server (to traverse the DOM)
This node application uses the dependencies specified in the “Package.json” which we can install using “npm install”
{ "name": "node-web-scrape-blog", "version": "1.0.0", "description": "Scrape Album information", "main": "index.js", "author": "Annamalai", "dependencies": { "express": "latest", "request": "latest", "cheerio": "latest" } }
And it can be installed using npm install. In this tutorial, we will make a single request to Raaga and collecting the below specified information :
- Tile of the song
- Singers name
- Lyricist
After compiling this information, the output will be saved to a JSON file on our project directory. The application is mainly focusing on the functional part, and hence no UI is involved at this point.
Related: CRUD operation on Google Spreadsheet using NodeJS
How Can We Implement Web Scraper In Our Application?
So Accomplish this process we must follow the basic procedure as i mentioned below,
1) Launch web server
2) Hit the URL (http://localhost:3000/webscrape) in a browser / rest api client.
3) The scraper will make a request to the website we want to scrape
4) The request will capture the HTML of the website and pass it along to our server
5) We will render the DOM and extract the information which we want
6) Next, we will format the extracted data into the specific format as we need
7) Finally, we can save this formatted data into a JSON file on our machine.
Now the entire application logic is served in the index.js file.
var express = require('express'); var fs = require('fs'); var request = require('request'); var cheerio = require('cheerio'); var app = express(); app.get('/webscrape', function(req, res){ // Let's scrape the album url = 'https://www.raaga.com/tamil/album/Yuvans-Romantic-Hits-songs-TC0000280'; request(url, function(error, response, html){ if(!error){ var $ = cheerio.load(html); var result = []; $('.new-album-track-row .new-album-track-details').filter(function(){ var json = { title : "", singers : "", lyricist : ""}; $(this).find('.new-track-name .history-ajaxy').filter(function(){ var data = $(this); json.title = data.text().trim(); }) $(this). find('.new-singers-name .history-ajaxy').each(function(index){ var data = $(this); if (index==0) json.singers = data.text().trim(); if (index==1) json.lyricist = data.text().trim(); }) result.push(json); }) fs.writeFile('raaga_output.json', JSON.stringify(result, null, 4), function(err){ console.log('File successfully written! - Check your project directory for the raaga_output.json file'); }) res.send('File successfully written! - Check your project directory for the raaga_output.json file'); } }) }) app.listen('3000') console.log('Web Scrape happens on port 3000'); exports = module.exports = app;
Request Process
To make the request to the external website, we can use ‘webscrape’ to route and it will call the raaga URL using the ‘request’ package and will collect the HTML content.
The response HTML is parsed with ‘cheerio’(JQuery like a parser for server-side applications) and once the DOM information is collected, then the data will be extracted to frame the required output.
Best To Read: Build A Node.js API In 30 Minutes
Iterating Process In Node.Js
Now we are ready to start rendering the DOM to extract the information. First, let’s get the album related information then will inspect the following Raaga album URL ‘https://raaga.com/tamil/album/Yuvans-Romantic-Hits-songs-TC0000280’ using Chrome Developer Tools and inspect the song title element.
If you want to collect the specific information from the unique element then you can able to see it by clicking on the particular element. So here i have chosen the class named ‘new-album-track-row .new-album-track-details’.
Also you can see the sub classes i have used to collect the data,
New-track-name – To collect the song title
New-singers-name – To collect the Singer & Lyricist information
As ‘new-singers-name’ is available twice, and the first one has the Singer information and the second one is having ‘lyricist’ information. In turn to iterate the data, we can do it using the function called ‘each’ to collect the exact data. Now you can maintain the output in a JSON array and also you can store it in JSON file as the final Response (‘raaga_output.json’)
Sample Response :
[ { "title": "High On Love", "singers": "Sid Sriram", "lyricist": "Niranjan Bharathi" }, { "title": "Neee", "singers": "Yuvan Shankar Raja", "lyricist": "" }, { "title": "Azhagae", "singers": "Arun Kamath", "lyricist": "Jonita Gandhi" }, { "title": "Mazhai Megam", "singers": "Yuvan Shankar Raja", "lyricist": "Priya Jerson" }, { "title": "Yaaro Ucchikilai Meley", "singers": "Yuvan Shankar Raja", "lyricist": "Rita" }, { "title": "Paarai Mele", "singers": "Yuvan Shankar Raja", "lyricist": "Snehan" }, { "title": "En Anbae En Anbae", "singers": "Shankar Mahadevan", "lyricist": "" }, { "title": "Aiyaiyo", "singers": "Manicka Vinayagam", "lyricist": "Krishnaraj" }, { "title": "Manmadhanae", "singers": "Sadhana Sargam", "lyricist": "Snehan" } ]
Okay guys, it’s sounds perfect! Successfully we have extracted the required data from the pool of information. Hopefully you can also start applying this technique in Node.js to web scrape the data from any website as you wish. So try it out and share your awesome results with us. Alongside, if you get stuck at anywhere else on the flow to fix it up then post your queries in the comment box , we’re glad to help you out.
Hope this helps you! Similarly you can learn more on interesting on latest technologies, never miss out anything from Our largest blog portal where you can get continuous blog updates & latest posts about all the latest technologies which would be perfect for your 15 minutes tea break! In case if you’re a newbie then don’t forget to subscribe us to get the latest updates from diverse technologies. What else guys hit the subscribe link and go crazy over learning.
For more inquires reach us via info@agiratech.com