Articles island - a directory of quality articles, free quality articles reprint for your web site and email newsletter.
Free Articles Reprint for Your Web Site, Email Newsletter, Blog, Ezine and RSS Feed.
Submit Your Articles to Our Article Directory for Massive Exposure.
Total Live Articles: 92613  Total Categories: 389



 
  Advanced Search
Articles island Expert Author - Glinda McDuffie
An entrepreneur since opening her first business at age 25, Glinda McDuffie has nearly 30 years experience as a successful business person. Now Glinda is focusing her energies on taking her brick-and-mortar experience to the 'net where she will build her next empire.
Home » Internet-marketing-tutorials » Search-engine-optimization » Search Engines vs. SEO Spam: Statistical Methods

Search Engines vs. SEO Spam: Statistical Methods

By: Oleg Ishenko
Total views: 6
Word Count: 1921
Date:Dec 1st 2006
Article Rating: No Ratings Yet

High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site's business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.

In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called "black-hat" SEO.

'Black Hat' SEO and Search Engine Spam

The oldest and simplest "black SEO" strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However "black-hat' SEO went one step further creating the so-called "doorway' pages - tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.

Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of "black-hat"' SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.

"Black-hat" SEO is responsible for the immense amount of search engine spam -- pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.

Using Statistics to Detect Search Engine Spam

An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.

Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

The research concentrates on studying the following properties of web pages:

- URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).

- Host name resolutions.

- Linkage properties.

- Content properties.

- Content evolution properties.

- Clustering properties.

URL Properties

Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits -- and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

Host Name Resolutions

One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs -- to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

Linkage Properties

The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

Content Properties

Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 "OK").

Content Evolution

The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

Clustering Properties

Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.

The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)

To Sum Up

The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.

References:

1. Dennis Fetterly, Mark Manasse, Marc Najork. "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (2004). Microsoft Research.

2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. "Syntactic Clustering of the Web". In 6th International World Wide Web Conference, April 1997.

About The Author-- Oleg Ishenko, MCSE, MCDBA, BSc

Get more useful info on SEO at our SEO Research

Article Source: Articles island - Free article submission and free reprint articles


Most Viewed Search Engine Optimization Articles




Most Viewed Search Engine Optimization Articles:

Avoid These 4 Lethal Affiliate Marketing Mistakes
If you are trying to make money online you should read this article. Learn some of the mistakes that will kil...

Number of Keywords on a Page
Optimizing the number of keywords on your website to get search engines to rank your Web site at the top of th...

SEO Help: Why You Might Need It
How hiring an SEO expert will help you improve your website, increasing sales, improving your ranking with the...

31 Ways To Promote Your Website
31 quick and easy steps to generate traffic to your website....

Common SEO Mistakes
A brief description of the most common SEO (search engine optimization) mistakes and how to avoid them, to imp...

Introduction To SEO
A description and introduction to SEO (search engine optimization), how it works and how it will help you incr...

Did SEO Really Live Up To Its Good Name In The End?
Why would you like to take a chance with your website? When you need more traffic, the only real way to make i...

Are You Lost In The Search Engine Maze?
There are many things that need to be accomplished to ensure that your site is optimized for the best results....

SEO - 4 Best SEO Tips To Get To The Top On Google
If you are a website owner, you will definitely want to get your website rank well in Google organic listings ...

When Push Comes to Shove!
All idioms and cliches aside, when friendly competition escalates into an all out hostile takeover for a compe...


Recent Search Engine Optimization Articles




Recent Search Engine Optimization Articles:

Keyword Basics Defined by SEO Services Company
Keywords play a major role and considered as the backbone of any SEO Process. The website can be listed in sea...

Outsourcing SEO Services
Outsourcing SEO services are on the rise in the present scenario despite having certain pros and cons. As it ...

SEO - 4 Best SEO Tips To Get To The Top On Google
If you are a website owner, you will definitely want to get your website rank well in Google organic listings ...

SEO - What is Search Engine Optimization and How It Can Benefit Your Business
Search engine optimization also known as SEO, is a process of choosing appropriate and targeted keyword phrase...

Title Optimization - How To Use Keywords Effectively in Your Website Titles
Placing keywords in titles of your web pages is essential in order to obtain good organic search engine rankin...

Search Engine Optimization - What is the Right Keyword Density?
Good keyword density is an important element in a successful search engine optimization campaign. It is an ind...

When Push Comes to Shove!
All idioms and cliches aside, when friendly competition escalates into an all out hostile takeover for a compe...

Expert SEO Strategy for Capturing Ephemeral Web Traffic
Here is an expert strategy straight out of the natural world for rapidly increasing traffic to your blog or we...

The SEO Food Chain, Are You a Ranking Predator or Prey?
Just like a predator, the parallels between SEO, hunting keywords and hapless animals falling prey to the food...

Attracting More Traffic To Your Website
When it comes to search engine optimization many website owners find themselves struggling to compete....

Most Viewed Articles by Oleg Ishenko




Most Viewed Articles by Oleg Ishenko:

Online Copywriting. Copy and Content: Any Difference?
People think that copy and content are two different things, since they serve two different purposes of motiva...

Link Popularity Building Strategies and Tips
Some vital strategies on link building and tips how to make the most of the incoming links and avoid search en...

Link Popularity: Relevance and Authority
The nature of the Web as an interlinked hypertext environment suggests that links can be used to measure the d...

A Threat to Your Wordpress Blog: Duplicate Content
Most bloggers are not aware of a serious threat to their search engine ranking: duplicate content. Because of ...

Atlantis Royal Towers: Nice But Ridiculously Expensive
A traveler reviews a trip to Bahamas and a stay at Atlantis Royal Towers Paradise Island Bahamas. There are al...

Search Engines vs. SEO Spam: Statistical Methods
Search engines continiously battle web spam produced by 'black-hat' SEO - millions of pages created to manipul...

Radisson Aruba Resort And Casino: Review of a Trip
A traveler reviews her stay at Radisson Aruba, gives her impression on the hotel service, island attraction an...

Paradise Island Harbour Resort: Generally Satisfied
A traveler tells about his trip to Bahamas and reviews his stay at Paradise Island Harbour resort. Although th...

Jamaica: the Treasure Island of the Caribbean
Jamaica. You are sitting in your cubicle bored to death with a tedious office job when suddenly this word just...

Aruba Honeymoon. Review of Renaissance Aruba Beach Resort
A traveler reviews his honeymoon trip to Aruba Renaissance Beach Resort and gives some useful tips on attracti...

You have permission to publish or reprint this article in your ezine, website, blog, forum, RSS feed or print publication, free of charge. As long as you keep this article with no changes(included Article Title, Article Body, Author Name, Article Source and keep all links in this article active)and you agree to our publisher terms of service. Below are ready HTML code for this article, you can copy and paste directly into your web page.

Search Engines vs. SEO Spam: Statistical Methods -- HTML Version:


Search Engines vs. SEO Spam: Statistical Methods -- Summary:

Search Engines vs. SEO Spam: Statistical Methods -- Keywords:
1   2   3 Good!   4   5   6 Very Good!!   7   8   9   10 Excellent!!!  
Comments:
No Comment Posted.

Leave Comment: Please Login to leave a comment. Not a member yet? Sign Up now.