Table of Contents
Sitemap generators allow webmasters to easily generate sitemaps for their websites instead of manually preparing it in a spreadsheet, or by writing a script. There are many ways to generate a sitemap for a website in a secure way. For example, if you have a WordPress site, then many sitemap generating plugins are available.
Here I was working for a Client project based on Ruby on Rails and had to generate a sitemap for my project. Generating a sitemap is beneficial and generating one using Ruby on Rails will be a breeze for developers like us. Here I have made it much simpler and discussed the step by step procedure of generating sitemap and uploading it to Amazon S3. Hope this article helps you when you come across a similar situation.
Before we dive into the process of generating a sitemap. Let’s understand What a sitemap can actually do:
What is a sitemap?
A sitemap is a protocol to get your sites URLs properly indexed on search engine bots for crawling and having a better positioning. It shows the way the website is organized and how each page is interconnected with the content of the website and how each page is navigated from one hierarchy to the next hierarchy. Using sitemaps, webmasters will be able to include information about URL like the last updated status, the frequency of the changes, and its relation to other URLs on the site. This makes the crawling process more insightful.
Normally it would look like below, if you need more details, please check sitemaps.org
For Example:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://www.agiratech.co.uk/</loc> <lastmod>2005-01-01</lastmod> <changefreq>weekly</changefreq> <priority>0.9</priority> </url> <!-- More URL definitions --> </urlset>
We have several sitemap schema definitions (shortened here), and after that, we get all the URLs to be mapped and indexed.
We can automate this process with the help of sitemap generator gem. You can also build it manually using XML builder or hand-craft an XML file.
ALSO READ: Why Ruby on Rails is Perfect for Ecommerce
Using the gem
This gem beneficial since it follows Sitemap 0.9 protocol. Apart from regular links, it supports images, video and Geo sitemaps too.
First, start by adding this to the Gemfile:
gem 'sitemap_generator'
once you run bundle install, run the below rake task to have a default config/sitemap.rb file you can edit[/code]
rake sitemap:install
Simple Example
Here is a simple example
# Set the host name for URL creation SitemapGenerator::Sitemap.default_host = "http://www.agiratech.com" # pick a safe place safe to write the files SitemapGenerator::Sitemap.public_path = tmp/sitemaps/' SitemapGenerator::Sitemap.create do add clients_path, priority: 0.9 add team_path, priority: 0.8 add about_path, priority: 1.0 add contact_path add blogs_path, changefreq: 'weekly' Blog.find_each do |blog| add blog_path(blog.slug), lastmod: blog.updated_at, priority: 0.7, changefreq: 'never' end end
There are a few things you need to note here
- Set default_host to your root website URL. The search engines reading your sitemap need to know what website they are dealing with.
- Set public_path to tmp/sitemaps to write our sitemap files before uploading.
- Adding URLs, see below for more details
Adding URLs
call add in the block passed to create to add a path to your sitemap.
The blogs_path has the changefreq set to weekly, as we want to indicate the site crawlers and indexers information about how often that index is likely to change. If we were to publish a new blog every day, we could set it to daily.
The about_path, we’ve used the priority parameter and set it to 1.0 as we want it to be considered as the most important page for indexers and crawlers since we want this page to appear first in search results.
The last addition is more interesting, as they relate to indexing dynamic content. On our blog model, we are using the slug in the URL, so instead of having
https://www.agiratech.co.uk/blogs/1
we have
https://www.agiratech.co.uk/blogs/sitemap-generation.
To get the blogs indexed the correct way, we need to add the URL for each blog searching by the slug.
Additionally, we’ve set the changefreq to never, as once a blog is published, it’s unlikely to be changed.
Generating the sitemaps using Rails :
The gem provides a series of tasks to create your sitemap
rake sitemap:create
The above task generates the compressed XML file under the folder specified in the public_path
rake sitemap:refresh
The above task does the same as the previous ones, but it will ping Google and Bing search engines so they know to fetch your newly created sitemap and update their indexed information about the site. You can ping other search engines as well, as stated in the docs.
Finally, you should set a cron job on your server to call rake sitemap:refresh as often as needed.
ALSO READ: Top Companies That Use Ruby on Rails
Uploading the sitemaps to s3
Normally, using the default configurations and working on a VPS should not add difficulties to search engines to fetch your sitemap from your public folder, as the file would be reachable from, following with our example: https://www.agiratech.co.uk/sitemap.xml.gz.
However, in the case our application is hosted on Heroku, we face two problems, due to its ephemeral filesystem:
- We can’t write in the public folder. That’s why we use the tmp folder on our previous sitemap configuration file.
- We can’t guarantee for how long will be in the tmp folder what we save there.
To get around this, what we need is to host our generated sitemap somewhere else, and then allow the search engines to access it. The Sitemap Generator gem offers ways to save the generated file on S3 using fog or carrierwave, so if you already use either of those on your application, you can have a look at this wiki page. However, installing Fog or Carrierwave just for this can be a bit overkill, so here’s a way to do that depending only on the AWS-SDK gem.
Once we have the AWS-SDK gem installed, we will also need to have an Amazon S3 bucket and the proper credentials set on the corresponding Heroku configuration panel, and/or your local environment, for tests
- An S3 Access Key Id: ENV[‘S3_ACCESS_KEY_ID’]
- An S3 Secret Access Key: ENV[‘S3_SECRET_ACCESS_KEY’]
- The name of the bucket to use: ENV[‘S3_BUCKET’]
Once this is set in settings.yml, we will need a rake task like the following:
For Example :
namespace 'sitemap' do desc 'Upload the sitemap files to S3' task :upload_to_s3 => :environment do Aws.config.update({ :region => Settings.sitemaps.aws.region, :credentials=>Aws::Credentials.new(Settings.sitemaps.aws.access_key_id, Settings.sitemaps.aws.access_key_secret) }) Dir.entries(File.join(Rails.root, "tmp/sitemaps/")).each do |file_name| next unless file_name.include?('sitemap.xml.gz') file = File.read(File.join(Rails.root, "tmp/sitemaps/", file_name)) s3 = Aws::S3::Client.new object = s3.put_object(:bucket => Settings.sitemaps.aws.bucket, :key => file_name, :body => file, :acl => 'public-read') puts "Saved to S3: #{Settings.sitemaps.aws.bucket}/#{file_name}" end end end
Using the above task, we’ll write the file to our remote bucket, under a sitemap folder, which should be configured as writable on your AWS panel.
Finally, we will need a rake task that we can program on our cron that takes care of everything: create the sitemap, upload it to S3 and ping the search engines:
Rake::Task["sitemap:create"].enhance do if Rails.env.production? && Settings.sitemaps.ping_enabled? Rake::Task["sitemap:upload_to_s3"].invoke SitemapGenerator::Sitemap.ping_search_engines(:sitemap_index_url => "https://#{Settings.sitemaps.aws.bucket}.s3.amazonaws.com/sitemaps/sitemap.xml.gz") end end
We are extending the default rake task using enhance. Note that on the last invocation, we’re sending the search engines the URL where they can find our sitemap. But the file is not on our server
Configure sitemap in robots.txt
Robots.txt is a standard used by websites to communicate with web crawlers and other web robots. In your public/robots.txt, set Sitemap to the URL of your remote sitemap endpoint:
Sitemap: https://#{Settings.sitemaps.aws.bucket}.s3.amazonaws.com/sitemaps/sitemap.xml.gz
With the help of scheduler or cron, we can automate the above rake task using below command Schedule sitemap in cron
rake sitemap:refresh
Conclusion
Finally, Sitemaps are particularly beneficial on websites in the following cases:
- If an area of a website is not available through a browser interface.
- Search engines normally don’t process Ajax, Flash or Silverlight content, If a webmaster uses this kind of content then having a sitemap can be beneficial.
- If our site is huge, The web crawlers may sometimes look only for new content. Also if you have many pages on your website there are chances that they are not well linked. So it is beneficial to h a ve sitemap in these cases.
To conclude, I hope this post is informative and helpful to you. Being a Ruby on Rails expert generating this sitemap just took me a few minutes. Our team at agira technologies have worked on different projects using Ruby on Rails. Follow us to know more about our Ruby on Rails works.