Our aim is to give you the best information on the web. This website will give you up to date, relevant useful information and guides on a huge range of topics.

 


Category: Google - Sandboxing - New Google patent proves "sandbox" exists

 

 

New Google patent proves "sandbox" exists

Information retrieval based on historical data
United States Patent Application 20050071741
March 31, 2005

 

I have read that this patent confirms the “sandbox theory”, which I feel is quite untrue. The fact that Google have decided to include historical data as one of the variables in website scoring is new for the algorithm but not as a method*. Remember that the patent seems really long, but lawyers like cover everything so there's a lot of repetition.

 

“A system identifies a document and obtains one or more types of history data associated with the document. The system may generate a score for the document based, at least in part, on the one or more types of history data.”

 

Links:
This shows that documents are not subject to this before they have at least 1 link and a certain number of pages (now you have to find how many – this easily stops large spam sites getting into the index). New sites entering the index will not be subject to these rules, it’s obviously unfair. Without this, yes you would be talking about very unfair treatment for the new websites.

 

Claim 22 (23, 25) and claim 23 (24), claim 1(26, 31, 32), claim 6 (27), claim 26 (28)
This means that links are still taken into account, but this time, they’ll take into account how often they get updated. You can see how spammy sites can suffer from this. If many of the links to your document disappear, it might be useful to find out why. Is your site now unpopular due to something that changed? It is also important to consider the pattern of change in links over time. Weights get assigned on how fresh these links are. If something on your site changes and suddenly lots of other sources link to it, in a news article or something, then it indicates that you’re producing interesting content. If a rush of links, with all the same anchor text, or similar anchor text suddenly appear to be linking to you, then its easier to establish why. The authority and other factors considered for the reliability and ”interest score” in the sites linking to you have an effect on your own site, due to the score assigned to the links coming in. If the links are old, a lot less weight is assigned. Anchor text also has to be related to your document. If the link doesn’t correspond anymore, then the weight changes (understandably).

 

Content:
This is to assess how fresh the content is on the site. This prevents static or abandoned websites from climbing up the scoring system.

 

The frequency at which the content of the document changes:
This ensures that sites do not grow at an unreasonable rate (an unnatural rate), during a certain time, compared to its normal rate of change and number of changes (news site for example will be different to a personal website). It’s a good way to identify which sites share the same characteristics and which group they belong to.


Content of a document change claim 8(9) describes pretty standard ways of measuring new content and its availability over time. They will also use parts of a document. Again, news sites for example will have just their news section changing not the rest of their site 8(10).

 

Method 6(11) basically tells you that they will be able to place a date on the last time content changed on your website. How do they propose to do that do you think?

 

This tells you that as well as parts of your pages, they will also take into account the number of pages changed and assess it in the same way.

 

Relevancy and fresh content are important.

 

Document Division:
The importance of the different parts of a document is also taken into consideration. If you change your menu, it’s unlikely to be considered a valuable change is it?

 

Inclusion in search results:
“Information relating to how often the document is selected when the document is included in a set of search results”

 

This is surely the most interesting claim don’t you think? The history variable will include the number of times your documents appear in search results, and the number of times people choose to go there. Perhaps ranking becomes very important now.

 

You get more points the more popular your document proves to be.

 

These claims are ensuring that you keep users interested by providing lots of new and interesting content on a regular basis and a good reason for them to return. This is helped by syndication.

 

(Search terms and queries) This ensures that those pertaining to be associated to certain keywords, when in fact they are not, do not get included in them in future. It ensures that websites stick to their subject matter, thus making classification a lot easier.

 

Stale documents:
This makes sure that abandoned websites are discarded. It’s a good and simple method for clean up.

Claim 19(20, 21)


If the documents which are considered stale are in fact still popular, they are still considered useful and won’t be trashed.

 

Traffic
Claim 1 (34, 35)
This means that traffic patterns will also get analysed. The traffic to your document gets taken into consideration as well how much time people spend on your site. If they leave straight away for the most part, then ouch for you.

 

Domains:
They check domains to establish whether its legitimate and whether its still current. The domain name server is use to check this (ICANN?)

 

Ranking:
Prior ranking is taken into account also. The pattern of your ranking is also considered important, as are the search terms used to access your document. If users can contribute to your content, then it’s considered that they found it interesting enough. Users bookmarking, putting your site in favourites, temp files, and cache files … this indicates your site is considered good by users.

 

Penalizations include:
Short-lived links
Link sources being stale (stale sites would be penalized remember – Boost if the opposite is true)
Link churn is above a certain threshold.


I fail to see where the “sandbox” is put into evidence. Maybe people see this and believe that the historical data will mean that sites which are new with no history will be included in this. It does clearly say that sites won’t be part of this until they have a certain amount of variables to work on. By the time you have enough, well your site has to have been around for a little while, if you are doing the right things.

 

Also, none of this is yet implemented is it…so how does this explain the “sandbox” and the amount of time people believe it has been about?

 

I think it’s a great patent, well done Google. You still rock. I knew you were going to throw something up. This definitely makes them much more likely to provide great results. Lets still keep in mind that MSN search is a baby just out of beta though.

 

I also wanted to add that everyone has immediately assumed that the patent and the methods presented in it referred explicitly to web search. Digital libraries use these methods and it seems really probable to me that they may use it for Google Scholar.

 

* Web Structure, Age and Page Quality - Baeza-Yates, Saint-Jean, Castillo (2002)

"This paper is aimed at the study of quantitative measures of therelation between Web structure, age, and quality of Web pages.Quality is studied from different link-based metrics and theirrelationship with the structure of the Web and the last modification time of a page"

 

"as expected PageRank is biased against new pages..."

 

"we also gathered time information (last modified date) for each page as informed by the web servers."... "here we focus on web page age, that is, the time elapsed after the last modification. As the web is young, we use months as a unit, and our study considers only the last 3 years as most websites are that young.

 

"the low correlation between pageRank and authority is suprising because both ranks are based on incoming links..."

"Notice the correlation between hub/authority, which is relatively low but with a higher value for pages about 8 months old..."

 

Source: Search-Science - Blog MSN

Date: April 2005

 

 

 

Google
 
Search The Web Search Experts Guide
 

 

UK Experts Guide on Just About Everything