GazoPa home | News | Sports | Twitter | Funny | Flickr | Plugin | Feedback | FAQ

Sign in



About GazoPabot: GazoPa crawler

1. Why are you crawling my site?

GazoPabot is GazoPa's web image indexing robot. The GazoPabot crawler collects images from the Web to build a searchable index for search services using the GazoPa. These images are discovered and crawled because web pages contain these images.

As part of the crawling effort, the GazoPabot crawler takes robots.txt standards into account to ensure we do not crawl and index content from those pages whose content you do not want included in GazoPa Search. If a page is disallowed to be crawled by robots.txt standards, GazoPa does not read or use the contents of that page.

Note: The URL of a disallowed page may be included in GazoPa as a "thin" document with no image content. Links and reference text from other public web pages may provide identifiable information about a URL and may be indexed as part of web search coverage.

2. How do I prevent my site or certain subdirectories from being crawled?

GazoPabot obeys the Robot Exclusion Standard. Specifically, GazoPabot adheres to the 1996 Robots Exclusion Standard (RES).

GazoPabot obeys the first entry in the robots.txt file with a User-agent containing "GazoPabot."

  • If there is no such record, it will obey the first entry with a User-agent of "*".
  • If it is not able to retrieve a robots.txt file, it will assume there are no restrictions for GazoPabot. It will keep trying to retrieve the file, and will obey it if becomes available.

Disallowed documents, including slash "/" (the home page of the site), are not crawled, nor are links in those documents followed. GazoPabot does read the home page at each site and uses it internally, but if it is disallowed, it is neither indexed nor followed. If a page has robots.txt standards disallowing it to be crawled, GazoPa will not read or use the contents of that page.

Note: The URL of a disallowed page might be included in GazoPa Search as a "thin" document with no text content. Links and reference text from other public web pages may provide identifiable information about a URL and may be included as part of web search coverage.

Example robots.txt:
User-agent: GazoPabot
Disallow: /cgi-bin/

Additional Symbols

Additional symbols allowed in the robots.txt directives include:

'*' - matches a sequence of characters
'$' - anchors at the end of the URL string

Using Wildcard Match: '*'

A '*' in robots directives is used to wildcard match a sequence of characters in your URL. You can use this symbol in any part of the URL string that you provide in the robots directive.

Example of '*':
User-agent: GazoPabot
Allow: /public*/
Disallow: /*_print*.html
Disallow: /*?sessionid

The robots directives above:

  1. Allow all directories that begin with "public" to be crawled.
  2. Example: /public_html/ or /public_graphs/
  3. Disallow files or directories which contain "_print" to be crawled.
  4. Example: /card_print.html or /store_print/product.html
  5. Disallow files with "?sessionid" in their URL string to be crawled.
  6. Example: /cart.php?sessionid=342bca31

    Note: A trailing '*' is not needed since that is the matching behavior for GazoPabot.

    In the example below, both 'Disallow' directives are equivalent:

    User-agent: GazoPabot
    Disallow: /private*
    Disallow: /private
    

    Using '$'

    A '$' in robots directives is used to anchor the match to the end of the URL string. Without this symbol, GazoPabot would match all URLs against the directives, treating the directives as a prefix.

    Example of '$':
    User-agent: GazoPabot
    Disallow: /*.gif$
    Allow: /*?$
    

    The robots directives above:

    1. Disallow all files ending in '.gif' in your entire site.
    2. Note: Omitting the '$' would disallow all files containing '.gif' in their file path.
    3. Allow all files ending in '?' to be included. This would not allow files that just contain '?' somewhere in the URL string.
    4. Note: The '$' symbol only makes sense at the end of the string. Hence, when GazoPabot encounters a '$' symbol, it assumes the directive terminates there and any characters after that symbol are ignored.

    Using Allow:

    The 'Allow' tag is supported as shown in the examples above.