Tuesday, January 22, 2008

10 ways to increase pages increased

For a while now webmasters have fretted over why all of the pages of their website are not indexed. As usual there doesn't seem to be any definite answer. But some things are definite, if not automatic, and some things seem like pretty darn good guesses.


Editor's Note: If you know a good way to increase the number, or are certain (or can guess) of a way to get all of a website's pages crawled, then please join the conversation by letting us know in the comments section.

So, we scoured the forums, blogs, and Google's own guidelines for increasing the number of pages Google indexes, and came up with our (and the community's) best guesses. The running consensus is that a webmaster shouldn't expect to get all of their pages crawled and indexed, but there are ways to increase the number.

PageRank

It depends a lot on PageRank. The higher your PageRank the more pages that will be indexed. PageRank isn't a blanket number for all your pages. Each page has its own PageRank. A high PageRank gives the Googlebot more of a reason to return. Matt Cutts confirms, too, that a higher PageRank means a deeper crawl.

Links

Give the Googlebot something to follow. Links (especially deep links) from a high PageRank site are golden as the trust is already established.

Internal links can help, too. Link to important pages from your homepage. On content pages link to relevant content on other pages.

Sitemap

A lot of buzz around this one. Some report that a clear, well-structured Sitemap helped get all of their pages indexed. Google's Webmaster guidelines recommends submitting a Sitemap file, too:

· Tell us all about your pages by submitting a Sitemap file; help us learn which pages are most important to you and how often those pages change.

That page has other advice for improving crawlability, like fixing violations and validating robots.txt.

Some recommend having a Sitemap for every category or section of a site.

Speed

A recent O'Reilly report indicated that page load time and the ease with which the Googlebot can crawl a page may affect how many pages are indexed. The logic is that the faster the Googlebot can crawl, the greater number of pages that can be indexed.

This could involve simplifying the structures and/or navigation of the site. The spiders have difficulty with Flash and Ajax. A text version should be added in those instances.


Google's crawl caching proxy

Matt Cutts provides diagrams of how Google's crawl caching proxy at his blog. This was part of the Big Daddy update to make the engine faster. Any one of three indexes may crawl a site and send the information to a remote server, which is accessed by the remaining indexes (like the blog index or the AdSense index) instead of the bots for those indexes physically visiting your site. They will all use the mirror instead.

Verify

Verify the site with Google using the Webmaster tools.

Content, content, content

Make sure content is original. If a verbatim copy of another page, the Googlebot may skip it. Update frequently. This will keep the content fresh. Pages with an older timestamp might be viewed as static, outdated, or already indexed.

Staggered launch

Launching a huge number of pages at once could send off spam signals. In one forum, it is suggested that a webmaster launch a maximum of 5,000 pages per week.

Size matters

If you want tens of millions of pages indexed, your site will probably have to be on an Amazon.com or Microsoft.com level.

Know how your site is found, and tell Google

Find the top queries that lead to your site and remember that anchor text helps in links. Use Google's tools to see which of your pages are indexed, and if there are violations of some kind. Specify your preferred domain so Google knows what to index.

Thursday, January 17, 2008

Yahoo finds fault with Google's secret sauce

Yahoo Finds Fault with Google's Secret Sauce


As complex as Google's PageRank may be, search experts at Yahoo seem to think it's not complex enough. Based on patent filings, Yahoo is dabbling in ranking algorithms that incorporate more user behavior data in advance of the company's next run at toppling Google's haloed relevance.


Editor's Note: Yahoo, as usual, is fairly confident in its ability to create a search algorithm that meets or exceeds the quality of Google's. Lots of search players have felt the same and have yet to deliver. Do you think Yahoo will ever catch Google or is it just too late to take down the Mountain View Monolith. Let us know in the comment section.

Seeing will be believing when it happens, of course, as Google is highly secretive about how its search engine calculates PageRank. If history is any indication, they're already way ahead on behavioral factoring.

Nonetheless, Yahoo can afford the best search engineers in the business (if they can get them before Google does, anyway) and the patent filings shed some light on how PageRank is currently calculated and ways it might be improved in the future.

Bill Slawski, Director of Search Marketing at KeyRelevance, goes into painstaking detail of Yahoo's user data challenges at his SEObytheSea blog. Patent language, especially when dealing with algorithms, can be confusing and dense, so we'll just highlight a few interesting points and leave the lexicographical deciphering to you.

Some Yahoo assumptions about PageRank and flaws associated:
  • Internal and external links are often weighed equally even though internal links can be less reliable and more self-promotional. Some links, like disclaimer links, are rarely followed.
  • PageRank ignores that webpages are often purchased and repurposed, decay or become less valuable over time at variable rates.
  • Current calculations, like TrustRank, are engineered more to combat webspam than to reflect actual user behavior.
  • Sometimes PageRank deals with links in bulk, aggregating according host or domain, also known as blocked PageRank.

What Yahoo plans to do about it:
  • Measure link weight – influenced by the frequency with which users follow a link
  • Note when links are ignored and users leave (teleport) to another page of their choosing
  • Calculate the probability that a user stops and reads a webpage rather than views it and moves on.
  • Incorporate user data into the algorithm – "User Sensitive PageRank could reflect "the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected."
  • Personalize PageRank based on demographic information – age, gender, income, user location)
  • Emphasize recent information
  • Weigh anchor text more heavily – the patent filing calls anchor text "one of the most useful features used in ranking retrieved Web search results"

Sunday, January 13, 2008

January Newsflash

  • Python has been declared as programming language of 2007. It was a close finish, but in the end Python appeared to have the largest increase in ratings in one year time (2.04%). There is no clear reason why Python made this huge jump in 2007. Last month Python surpassed Perl for the first time in history, which is an indication that Python has become the "de facto" glue language at system level. It is especially beloved by system administrators and build managers. Chances are high that Python's star will rise further in 2008, thanks to the upcoming release of Python 3.

  • A couple of interesting trends can be derived from the 2007 data. First of all, languages without automated garbage collection are losing ground rapidly. The most prominent examples of languages with explicit memory management, C and C++, both lost about 2% in one year. Another trend is that the battle between scripting languages seems to be going on in the background. There is a continuous flow of new scripting languages. In 2006, Ruby entered the main scene, followed this year by Lua. In the top 50, Groovy and Factor are new kids on the block. None of these new scripting languages seem to stay permanently, they are just replaced by successors.

  • What were the big movers and shakers in 2007? The big winners are Lua (from 46 to 16), Groovy (from 66 to 31), Focus (from 78 to 41), and Factor (new at 45). The most prominent shakers are ABAP (from 15 to 29) and IDL (from 23 to 48).

  • What is to be expected in 2008? And, what became of the forecasts for 2007? At the beginning of 2007, I thought C# and D would become the winners and Perl and Delphi the losers. C# was indeed one of the big winners, and Perl one of the big losers. But the forecasts for D and Delphi were completely wrong. There has been no breakthrough for D. On the other hand, Delphi reclaimed a top 10 position... What about 2008? C, C++ and Perl will continue to fall. C and C++ because they have no automated garbage collection. C++ will get an extra push down because Microsoft is not actively supporting the language anymore. Perl is just dead. Java and C# will eventually be the 2 most popular languages. So I expect them to rise further in 2008. What new languages will enter the top 20 in 2008 is a wild guess, but I think ActionScript and Groovy are really serious candidates.

  • Nguyen Quang Chien suggested to rename the OCaml entry to Caml. This has been done. Thanks Nguyen!

  • In the tables below some long term trends are listed about categories of languages. The tables show that dynamically typed object-oriented languages are still becoming more popular.

    Category Ratings January 2008 Delta January 2007
    Object-Oriented Languages 56.1% +4.0%
    Procedural Languages 40.9% -3.6%
    Functional Languages 1.9% +0.2%
    Logical Languages 1.1% -0.6%


    Category Ratings January 2008 Delta January 2007
    Statically Typed Languages 56.2% -1.5%
    Dynamically Typed Languages 43.8% +1.5%

TIOBE declares Python as programming language of 2007 !!

The TIOBE Programming Community index gives an indication of the popularity of programming languages. The index is updated once a month. The ratings are based on the world-wide availability of skilled engineers, courses and third party vendors. The popular search engines Google, MSN, Yahoo!, and YouTube are used to calculate the ratings. Observe that the TIOBE index is not about the best programming language or the language in which most lines of code have been written.

The index can be used to check whether your programming skills are still up to date or to make a strategic decision about what programming language should be adopted when starting to build a new software system. The definition of the TIOBE index can be found here.

Position
Jan 2008
Position
Jan 2007
Delta in PositionProgramming LanguageRatings
Jan 2008
Delta
Jan 2007
Status
1 1 Java 20.849% +1.69% A
2 2 C 13.916% -1.89% A
3 4 (Visual) Basic 10.963% +1.84% A
4 5 PHP 9.195% +1.25% A
5 3 C++ 8.730% -1.70% A
6 8 Python 5.538% +2.04% A
7 6 Perl 5.247% -0.99% A
8 7 C# 4.856% +1.34% A
9 12 Delphi 3.335% +1.00% A
10 9 JavaScript 3.203% +0.36% A
11 10 Ruby 2.345% -0.17% A
12 13 PL/SQL 1.230% -0.34% A
13 11 SAS 1.204% -1.14% A
14 14 D 1.172% -0.16% A
15 18 COBOL 0.932% +0.30% A
16 46 Lua 0.579% +0.48% A--
17 22 FoxPro/xBase 0.506% +0.05% B
18 19 Pascal 0.456% -0.11% B
19 16 Lisp/Scheme 0.413% -0.26% A--
20 27 Logo 0.386% +0.07% B