.
Home |  Sitemap | Contact | privacy policy |  Add website |


Free Private Label Rights To 10 Brand New Private Rights Software Products, Which You Can DO ABSOLUTELY ANYTHING YOU WANT With And Make Handsome Profits Selling These Products As Your Own when you sign up to the newsletter, BIZWISE..

Every two weeks you get more software and scripts for download as well as simple strategies to make your online business successful. Click Here for a sample of the goodies awaiting you.

YOUR ADDRESS WILL NOT BE SHARED WITH ANYONE ELSE.



"How to" Topics

1. building link popularity
2. building good traffic keywords
3. build robots txt page
4. creating rich keyword pages
5. cross linking
6. domain name dashes
7. doorway pages
8. frames and search engines
9. glossary of search engine terms
10. guilty by association
11. hallway pages
12. higher rankings
13. improve rankings through ssi
14. improve web site link-popularity
15. keyword selections
16. keyword selection wordtracker
17. keywords to optimize for
18. link building
19. link checking
20. local search optimization
21. misspelled words
22. multiple domains site
23. online marketing
24. optimized pages
25. optimizing generic pages
26. organic search
27. search engine friendly
28. search engine optimization marketing
29. search engine prominence
30. search engine spider
31. search engine submission-program
32. sem mistakes
33. sem strategies
34. seven seo steps
35. spam seo
36. submitting web site to open directory
37. target audience
38. title tags
39. track sales
40. validate code
41. web pages
42. webposition myth
43. webposition page critic
44. webposition page generator
45. website indexing
46. web site traffic
47. your own domain name




How and Why to Build a Robots.txt


 

 

Some of you have asked "How do I keep 'search engine A' from indexing pages designed for 'search engine B'." The answer is to use a robots.txt file. There are also other reasons for wanting to keep search engines from indexing some or all pages on a site. Therefore, I�ve put together this detailed article to show you how to do that, and to avoid common mistakes that are made all too often.

If you create different versions of essentially the same doorway page and every search engine indexes every copy of the page, then you could, in theory, get in trouble for spamming. AltaVista in particular is known to dislike duplicate or near duplicate content. Therefore, if you create pages that are too similar, you run the risk of being red-flagged.

In practice, many people don't worry about having too many duplicate pages indexed by a single search engine because they are not creating huge numbers of similar pages. In fact, I spoke to the CEO�s of three search engine positioning companies. Each said they did not use robots.txt files, although their reasons varied.

If your pages vary enough in the size and number of words, then you should not need to worry about being red-flagged. If you focus primarily on optimizing existing pages on your site that have unique content rather than building a lot of new pages that are similar, you�ll also avoid any potential problems.

Other people simply submit pages designed for a particular search engine to just the engine(s) to which the page applies. This can be the simplest method to avoid spamming the search engines. This can work IF there are no other links to that page from the rest of your site. However, if another search engine spider manages to find a link to that page, it could index it even though you never submitted the page to that engine.

Despite this, two of the consulting companies I spoke to said submitting hallway pages that pointed to just the pages built for that engine worked well for them and they�d done it successfully for years. If you�re unfamiliar with the hallway page concept, see:

http://www.webposition.com/mp-0799.htm#ONE

However, if you want to create a lot of doorway pages, targeted for a large number of engines, where many will be very similar to each other, you should consider using a robots.txt file. This file can tell the search engine spiders which pages they are not allowed to index. That way you can build pages for search engine A and tell search engine B to ignore them. The search engine's like this because it keeps them from indexing pages that don�t apply to them. Therefore, it benefits the search engines, the users that use that engine, and it keeps you from being labeled as a spammer.

I have seen some people debate whether the search engines even honor robots.txt files since this is a purely voluntary feature of the Web. However, the search engines have historically been challenged by companies - and in the courts - about indexing copyrighted materials without the permission of the copyright holder. The search engine's most prominent argument for being able to index copyrighted material without permission is that the Web site owner always has the option to exclude their indexing by creating a robots.txt file.

Therefore, it's unlikely the search engines would intentionally ignore the robots.txt or they could get themselves into unnecessary legal problems. They might in theory spider the page and then after checking the robots.txt file drop it. This may explain reports I've heard from a couple of people that claim the spider ignored their robots.txt file because they saw it opening the page in their log file. Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check your work.

I'll try to point out common errors in this article and give plenty of examples. Please don't be intimidated. It really is not as hard as it looks. There is also a method you can set it up once and then not have to mess with it ever again. I�ll explain this method near the end of this article.

 

To create a robots.txt file, open Window's NotePad or any other editor that can save plain ASCII .txt files. Use the following syntax to exclude a file name from a particular search engine spider:

User-agent: {SpiderNameHere}
Disallow: {FilenameHere}

Note: For the purpose of this article, the term spider and search engine may be used interchangeably.

For example, to tell Excite's spider, called ArchitextSpider to not index files called orderform.html, product1.html, and product2.html, create a robots.txt file as follows:

User-agent: ArchitextSpider
Disallow: /orderform.html
Disallow: /product1.html
Disallow: /product2.html

According to the official robots.txt specifications, the above is case-sensitive so you should spell it as "User-agent:" rather than "User-Agent:". Whether this causes a problem in practice, I cannot say for certain. To be safe, keep the names in the correct case. In addition, make sure you include a forward-slash before the file name if the file is in the root directory.

The User-agent line is the identifier for the search engine you wish to target. It is like a "code name" for the search engine's spider that goes around and indexes pages on the Web. It may be similar to the name of the search engine or it may be completely different. (I'll list the official User-agent names for the major engines later in this article).

Once you create your robots.txt, you would then upload this text file to the root directory of your Web site. Although robots.txt is a voluntary protocol, most major search engines will honor it. If you do not have your own domain name but instead use a subdirectory off of your host's domain, then your robots.txt may not be recognized in theory since standard practice is to look only at the root directory of the domain. This is just one more reason to invest in your own domain name!

You can add additional lines to exclude pages from other engines by specifying the User-Agent parameter again in the same file followed by more Disallow lines. Each disallow statement will be applied to the last User-Agent that was specified.

If you want to exclude an entire directory, use this syntax:

User-agent: ArchitextSpider
Disallow: /mydirectory/

A common mistake is to include the asterisk after the directory name to indicate that you want to exclude all files in that directory. However, the proper syntax is to NOT include any asterisks in the Disallow statement. According to the robots.txt specifications, it is implied that the above statement will disallow all files in "mydirectory."

To disallow a file named product.htm in the "mydirectory" subdirectory, do this:

User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm

You can exclude pages from ALL spiders with this User-agent:

User-agent: *

In the case of the User-agent line, you CAN use the asterisk as a wildcard.

To disallow all pages on your Web site for the specified spider use:

Disallow: /

To re-iterate, you use only a forward slash to indicate you want to disallow your entire site. Do NOT use an asterisk here. It's important that you use the proper syntax. If you misspell something, it may not work and you won't know it until it's too late! It is possible that certain search engines may handle common syntax variations without problems. However, this doesn't guarantee that they will all tolerate variances in the syntax. Therefore, play it safe. If at some point you do find that your syntax was wrong, don't panic. Correct the problem and then re-submit. The search engine will then re-spider the site and drop the pages that you excluded.

If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:

# Here are my comments about this entry.

Each set of disallow statements should be separated by a blank line. For example, you might have something like the following to exclude different files from different spiders:

User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htm

User-agent: Slurp
Disallow: /mydirectory/product3.htm
Disallow: /mydirectory/product4.htm


The blank line between the two groups is important to group things into "records."

If, on the other hand you wanted to exclude the same set of files for more than one spider, you could do something like this:

User-agent: ArchitextSpider
User-agent: Slurp
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htm

Side note about subdirectories: Some Webmasters like to organize their doorway pages into different subdirectories according to which search engine they are optimized for. However, some engines are suspected of assigning lower rankings to pages appearing in subdirectories versus the root directory of a Web site. If they perceive that those pages belong to a Web site that shares a domain with its host, they could discriminate against those pages as being potentially of a lesser quality. I asked three search engine consultants there opinion of subdirectories. The general feeling was that pages in the root directory was probably better, but they�d not seen evidence that it caused problems.

If you were still concerned about being penalized for keeping pages in subdirectories and wished to use them, you could ask your hosting service to give you "machine names" like myproduct.mydomain.com that you could submit. The myproduct.mydomain.com URL could then be configured by your hosting service to point to your "myproduct" subdirectory or whatever directory you desired. That way no discrimination could occur by the search engine since they would not see the subdirectory in the URL. In addition, you could include keywords in that machine name which may also improve your rankings. (Note: A machine name is normally just "www." prefixed at the start of your domain name. However, rather than "www." it could be any name you desire and it could point to any location on any physical machine).

We are often asked about the proper names for the User-agent. The name of the agent does not always correspond to the name of the search engine. Therefore, you can't just put in "AltaVista" in the User-agent and expect AltaVista to exclude your designated pages. Don't ask me why it can't be that simple. Perhaps it's a job security plan for professional Webmasters :-)

In any case, there's a lot of confusion in newsgroup forums and on the Web about what the proper agent names should be. The confusion derives from Webmasters reading their server log files and noticing all kinds of complicated agent names being logged such as Scooter/2.0 G.R.A.B. X2.0, or Slurp/2.0. However, the agent names listed in your log are not necessarily what you are expected to use in your robots.txt file.

The reason is very logical when you think about it. Names like Slurp/2.0 in a robots.txt are not very useful if the search engine updates their agent software and decides to start using Slurp/3.0 as their new name next month. Would it make sense to expect millions of Webmasters to know this and to all update their robots.txt files to the new name? Would they expect people to update the file EVERYTIME any search engine updated their agent version number and do it precisely when the name change occurred? It's not likely.

In reality, the name that needs to appear in the robots.txt file is whatever name the search engine spider is programmed to look for. Therefore, the best source of information for this name is not your log files but the help files on the search engine itself. In theory, a search engine could look for a wide variety of name variations. However, in general they will simply look for the least common denominator such as "Scooter" rather than "Scooter/2.0". If the search engine is smart they will allow you to use Scooter/2.0 too, but that is not guaranteed. Therefore, if you've already setup a robots.txt on your site, double-check the syntax and the agent names against the list below. All names are case sensitive.

Here are the User-Agent names that we have compiled. Most of these came directly from the search engine's own help files, or when not available, from other respected sources:

Search Engine: User-Agent
AltaVista: Scooter
Infoseek: Infoseek
Hotbot: Slurp
AOL: Slurp
Excite: ArchitextSpider
Google: Googlebot
Goto: Slurp:
Lycos: Lycos
MSN: Slurp
Netscape: Googlebot
NorthernLight: Gulliver
WebCrawler: ArchitextSpider
Iwon: Slurp
Fast: Fast
DirectHit: Grabber
Yahoo Web Pages: Googlebot
Looksmart Web Pages: Slurp

You'll notice that many of the engines use the "Slurp" agent which is the Inktomi spider used on HotBot and other Inktomi related sites. Unfortunately, I'm not aware of a way you can exclude pages from the HotBot spider and not exclude them from all other Inktomi sites. As far as I can tell, they use the same spider to index the pages and thereby recognize only one User-agent string in the robots.txt file. (If I am wrong, please reply to this e-mail and let me know how this is done!)

The individual Inktomi sites tend to rank the pages differently, although they will often be rather similar. Normally you can create a handful of pages that will rank well on most of the Inktomi powered sites, so the duplicated content issue does not normally become a big problem with Inktomi.

If you're now scratching your head on how this all comes together in relation to the doorway pages you created, check out the two detailed examples of a robots.txt file we've put together.

The first one shows how you can disallow INDIVIDUAL files:

http://www.marketposition.com/example/robots1.txt

The second example shows how you can group your doorway pages into DIRECTORIES and disallow the entire directory:

http://www.marketposition.com/example/robots2.txt

The advantage to method #1 is that it can be more flexible for working with a small number of files already on your site, and it "might" be a little safer. Some people believe that locating your doorways in the root directory rather than a subdirectory can give you a ranking advantage. The theory is that the search engines might discriminate against sites that don't have their own domain name, so pages submitted in a subdirectory could be perceived as sharing a domain with their host.

The disadvantage to method #1 is that if you have very many doorway pages then the size of your robots.txt file could be enormous. This runs the risk that a search engine might have problems with a robots.txt file that exceeds a certain reasonable size. It might also slow down the spider from accessing your site if it must read in an extremely large robots.txt file. Lastly, a robots.txt with a lot of entries in it could be a red-flag in itself to a search engine. This is all speculation, but it is enough that I would avoid excluding a lot of files individually if you don't have to.


Example method #2 is to organize your doorway pages into subdirectories for each search engine. The advantage to method #2 is that it is much easier to track your doorway pages if they are organized in separate subdirectories. In addition, the size of your robots.txt will be relatively small. You'll also not need to update the file every time you upload new doorway pages. Once the robots.txt is set up with method #2, all you have to do is upload it to the appropriate directory, submit and you're done!

So do the engines discriminate against files in subdirectories? The consultants I talked to did not think so. Based on these conversations, if you properly design a hallway page in your ROOT directory that links to the doorway pages in your subdirectory, and submit that hallway page, then you will be fine. This demonstrates to the engine that the pages are most likely sub-pages of the main site. In addition, it would be dangerous for the search engines to penalize pages in subdirectories since most large Web sites must organize their pages into subdirectories to avoid complete chaos. As an added precaution, you could assign machine names to subdirectories as I mentioned earlier in this article. If you have any experience, comments, or observations on this issue, please let me know by replying to this e-mail.

My conclusion: If all your pages have good content and are fairly unique, don't worry about robots.txt files. If you focus only on optimizing existing pages on your site, don't worry about a robots.txt. If, however, you decide you need to experiment with more than a handful of pages that are rather similar, consider making use of the robots.txt file, particularly with AltaVista. Use example method #1 if you are only dealing with a small number of pages or special scenarios. Otherwise, organize your files into directories and use example method #2.

This article is copyrighted and has been reprinted with permission from FirstPlace Software, the makers of WebPosition Gold. FirstPlace Software helped define the SEO industry with the introduction of the first product to track your rankings on the major search engines and to help you improve those rankings. A free trial of WebPosition Gold is available from their Web site.


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

 

 

Other Resources.

learn how to manage your and consolidate loan by reading these prolific articles on what to do before taking the next loan, mortgage or insurance on anything.

learn what to do, before and during your travels by reading these articles on travel tips and guide. An essential guide for every would be traveller.

A comprehensive tips and guide for home improvement. Provides articles on various home improvement projects and where and how to hire home improvement contractors .
A comprehensive tips and guide for home decorating. Read articles on home decorating tips, ideas, guides and where to get your supplies.

A comprehensive tips and guide on home security alarm systems. Provides information on all types of home security systems and how to buy them,how to do it yourself, where to buy them etc.
A comprehensive tips and guide on home buying tips and guide. Provides information on all types of homes and how to buy them,how to do it yourself, where to buy them etc.
An article submission site. Register free and submit your articles and have them reprinted and republished by other webmasters in their ezines and websites and receive targeted traffic to your website.
Provides current information on fixed mortgage interest rates and how to manage your mortgages and loans.

For comprehensive information on used car extended warranty go to http://www.extended-warranty-web.com

Find comprehensive information on home water filter http://www.water-filter- info.com

find recent and comprehensive information on adult acne

Discover necesary and important information on eating disorder treatment

find comprehensive information on data acquisition system.

for comprehensive information on defensive driving austin texas

discover various information on affordable auto insurance

find important information on satellite tv decoder






home base business opportunity | Sitemap | Contact | privacy policy |


COPYRIGHT © 2002. EbizStartUps.com All rights reserved. .