What are Search Engine Spiders and Why Should I Care?

Search engine spiders, also referred to as search engine robots and web crawlers. Are software bots, created (by the individual search engine companies) to scan the entire contents of a website. This includes htm, html, css, php, jpeg, gif, mp3, mov, etc. After the search engine bot, aka spider has scanned a website, it then reports that information back to the search engine index. When the information reaches the index, it is placed in queue. In most cases, any new or updated content that you create will not be available immediately in the search results.

As the information for your website remains in the waiting period, it is scanned for Quality Assurance Compliance. As you can imagine, the search engines want to ensure that your content is worthy of sharing with others. When a web page is found to be “Not in Compliance”, that page and/or the entire URL is simply removed from the index. Typically new content finds its way into the search results within a week or so. If it’s taking longer than a week for your website changes to take effect, you may need to add more quality content to your page or post.

Search Engine Spiders Quality Content

It’s best to keep the search engines away from a page, post or website until it’s ready for the public. To block the search engines, you simply need to edit or create a robots.txt file. This simple text file which needs to reside on the root of your html directory gives precise instructions to web crawlers. Below is an example of a robots.txt file that allows full access to all content on your web server.

User-agent: *

As you can tell, there’s not much to this. The first line of the robots.txt file calls out the search engines. The * is a wildcard and signifies “all”. The next line tells the search engines what they can view and what you’d like them to stay out of. In the example above, there is nothing after the word Disallow: This implies that you have nothing to hide.

The next example shows a robots.txt file that is requesting all search engines to stay away and not index the homepage or any other page for that matter.

User-agent: *
Disallow: /

The / character is the wildcard here. It means do not index “all”. There are lots of ways to modify a robots.txt file but in the end, it comes down to your specific requirements. You can block only certain search engines or allow other certain parts on your website to be available in the search engines simply by using the robots.txt file.

Search Engine Spiders Robots.txt

In addition to the robots.txt file, another method to ensure the search engines find exactly what you want them to see is by using an XML site map. Even though the search engine spiders are good at locating all types of content including image and videos, on occasion they may miss something. A site map acts as the search engines guide through your posts, pages, categories, tags and just about everything else. For those who use WordPress. There are free plugins such as “WordPress SEO by Yoast” that helps to automatically generate the necessary XML files needed for the crawlers. For those who create websites using html, php or asp technologies, use an XML site map generator website. They can easily be found by searching for the term “xml site generator”.

Other helpful tips like these can be found in a new SEO Training system called SEO Earthquake. This ground breaking, revolutionary training system gives you the knowledge you need to take your website to the next level. Learn more than just search engine optimization. Learn how to run your website like a boss! This entire training system is 100% newbie friendly and comes with a “No Questions Asked – Money Back Guarantee”. If you are looking to increase your online leads, customer inquiries and sales this training system is for you! Head over to the SEO Earthquake website today by click on the image below.


What are your thoughts, opinions or questions about this article?

3 comments on “What are Search Engine Spiders and Why Should I Care?
  1. Stephanie says:

    How would I block only certain areas of my site from the crawlers? I’d like to block my images from being indexed.

    • To hide a specific directory on your web server, copy the following code into your robots.txt file. Be sure to change “example-directory” to match the folder you’d like to be hidden.

      User-agent: *
      Disallow: /example-directory/

      To learn more about the robots.txt file, click here.

  2. Rob Turner says:

    Thanks for the info.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Article Author