From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage(Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matricesTwo are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

RELATED ARTICLES

Why Articles Are Not The Route To High Search Engine Rankings
If you have any interest in getting high search engine rankings for your website (and who doesn't) you've probably been sold the idea that writing and publishing your own articles will do it for you.

Search Engine Marketing Hype Killing Small Businesses
Think about the first thing you ever heard about "marketing a website" on the web. 99% of the time the first words anyone ever hears are "search engine marketing." Or some semblance of the phrase.

The Modern Day Search Engine
The first task most netizens do when they log on to the internet is to visit their favorite search engine to find information on various topics. Over the past few years the search engine has evolved to do more than list available websites on an entered subject. The modern day search engine has tools to help the user accomplish a vast array of everyday task.

Search Engine Marketing 101: What Search Engines See When They Visit Your Web Site
If you have a Web site, have you ever wondered what a search engine sees when it visits your site to add the site to its index? Do you know that it doesn't see the beautiful graphics or the fancy Web design? Do you know that it only sees the source code, or the "skeleton" of your Web site?

Professional SEO: Hand Off to Bob or Outsource the Job
We are often asked if professional SEO (search engine optimization) can be done effectively utilizing in-house talent. Despite our obvious self-interests on the subject, our answer is always a qualified "yes"? you can achieve professional SEO results using existing talent. However, for every company we have known that has met with great in-house SEO success, we know of many more that have seen their in-house efforts fail. We have also discovered the companies that have succeeded share some common traits.

Easy Steps to Get Onto Google Top Search Pages
To get on Google's top pages can be accomplished by following a few simple steps. The following procedure works well for companies working within a local region i.e. Lowell Pizza, or business with unique keyword phrases, i.e. Kraton thermoplastic rubber. The steps below outline a simple method to do content optimization. By following the procedure on your home page, you can achieve a high Google ranking. I will be assuming you have a basic understanding of html code.

Google Rankings ? Achieving a Top 10 Position in Google ? Part 1
Achieving a top ranking position in Google is every webmasters dream. Unfortunately very few ever make it high enough for it to make a big difference on their traffic volume. If you are one of the few to make it onto the Top 10 list, you have probably found out that your listing doesn't always stay on top. So, besides all the other advice online, it is also extremely important to under stand your competition, and to properly use their key words to your advantage. This article will be part one of a two part series. Below we will look at GoogleRankings.com and how this useful tool will help you in your quest to achieve a top listing within Google. Just as an addendum, we use these tools on a daily basis to achieve top 10 listings for our web properties.

Why Pay-Per-Inclusion Search Engines are Dying
A Pay-Per-Inclusion search engine is a service in whicha search engine charges you a certain amount to spiderand include your website in its database. For this fee,regular repeated spiderings are guaranteed, so you aresure to be indexed.

How to Avoid the Google Duplicate Content Filter?
More and more webmasters are building websites with publicly available content (data feeds, news feeds, articles). This results in websites with duplicate content on the Internet. In cases of websites build on news feeds or data feeds you can even find websites that match each other 100% (except for the design). Several copies of the same content in a search engine does not really do any good and so Google apparently decided to weed out some of this duplicate content to be able to deliver cleaner and better search results.

How to Verify and Monitor Your Search Engine Listing on Google?
Being listed in search engines and ranked high on searches is the overall goal a webmaster is trying to achieve when dealing with search engines. Search engine optimization is probably one of the most commonly used words among webmasters. In the Internet market becoming much more critical to business success this is almost natural.

Google Bring Deskbar Search To Windows Desktop. Now Any Website Can Take Advantage Of This
Google's premier of desktop search proves that the desktop is an extremely valuable marketing real estate. Google, which holds about 75% of the Internet search market, just introduced "Deskbar" ? a small desktop application that allows users to search Google directly from their desktops. Google currently rules the Internet, but positioning themselves on the desktop gives them the power to rule not only the Internet, but also the entire personal computer.

The Google Sandbox ? A Frustrating Inevitability or a Golden Opportunity?
Introduction

How To Get Listed In Google For Free
Google does not accept payment for inclusion in their main index ? so anyone that's at the top is there through hard work and perseverance, or through paying a company such as NikkiPilkington.com to do the hard work for you. Below I show you 7 great ways to increase your chances of a Google ranking.

The Mystery of the Magical Keyword Density Formula
Keyword density. When it comes to SEO copywriting, this has to be one of the most talked about subjects. Why? Because keywords are the very foundation of search engine copywriting. Without keywords we wouldn't even have SEO copywriting. Because keywords (or more accurately, keyphrases) play such an important role in search engine copywriting, it might make sense that there are certain rules and regulations - certain formulas - that should be followed. It might make sense, but, I'm sorry to say, the mystery? the magic? is more like a myth.

Beat Google?s Dampening Link Filter with SEO Articles
Most Search Engine Optimization (SEO) experts agree that links back to your site have a great impact on your ranking in the major search engines. Think of it like an election; your site is a candidate and every link to your site is a vote. Of course, it was never quite that simple (high ranking, relevant sites have more voting power) but now it may have gotten even more complicated.

Forget SEO ? It?s All About Conversion!
Which SEO hat do you wear? Is it white or black? Or perhaps it's a subtle shade of gray. Well, wherever you are on this spectrum, if you are like 99% of the SEO-fixated webmasters out there, you are doing all you can to get visitors to your site. Where you used to optimize your keywords meta tags, you now worry about anchor text and XML site maps.So, you finesse your site, schmooze the algorithm du jour and begin to climb the slippery search engine rankings pole. But this is where you have to leave your SEO tricks behind and start thinking conversion.Whoa there! Conversion? That's all about making sales on shopping sites. My site's a forum! I don't sell anything. What has conversion got to do with me?Everything, my friend. Everything?.SERP ConversionYou want visitors to come your site, right? That's why you climb greasy SERP pole, after all. (And why you lay awake at night wondering when the next Florida, Hilltop or Bourbon algorithm earthquake is going to hit.) But what does it matter if you reach the peak, but no one clicks through to your site anyway?The fact is that you have to convert prospective visitors even before they arrive. You have to pique their interest, tweak their curiosity and make sure they click your link and not the other guy's. And all you have to use are words. No technical tricks will help you now ? your only tools are plain ol' words.The SERPs will show words from two of three places. Two of these you can control and one you can't:1.Your title tag.2.Your meta description tag.3.A directory description.(For example, Google is known to use DMOZ descriptions in their SERPs, and Yahoo uses their directory descriptions in theirs.)Now, there's not much you can do about how Yahoo or DMOZ editors describe your site, so don't worry about number 3. But do what you can do: Place your value propositions clearly in 1 and 2, and use them to lure searchers into your website.So you have to convert people before they even get to your site. But that's not all?. You want them to stay, right?Flash ConversionNo, this has nothing to do with being blinded by bright lights on the road to Damascus. Nor am I telling you to go and make a 100% animated website.This "flash" is that split second, that tiny window of opportunity that opens up when a searcher arrives at your site and decides whether they have actually found what they are looking for? or not.Typically, you'll have a couple of seconds to persuade the new visitor to stay and hang out. That's all you get. It's the ultimate in speed-dating. So, how do you wow your "date"? Do you show him a page of scraped "content" interspersed with Adsense ads? Not if you want this to be a lasting relationship.Follow these tips and be sure to impress your "date":1. Have a clear themeRanking high for "blue widgets"? Then be sure your visible title, copy and images are clearly focused on them.2. Have a clean designImagine you are taking a first date to a movie. Do you pick her up wearing your pajamas in a dirty, trash-filled car? Or do you clean the car, wash your face and dress to impress? Of course you want to impress her, so you make sure you look (and smell) good. It's the same with your site: Clean design, a nice text size and pages that don't scroll from here to eternity are going to float your first-time visitor-date's boat.3. Write clearly and focus on your "date"Waffle on about yourself and your date will get bored. She wants you to be interested in her. And in terms of your site, that means giving her what she wants to know as quickly and succinctly as possible. This comes down to your ability to write and communicate. So, if you are like Steve Martin in "Roxanne", make sure you get a someone else (a copywriter) to compose the lines to wow your "date" for you.OK, so let's say you have successfully wooed your first-time visitor date. He's here and he has decided to stick around. You're going steady, getting to know each other a little more. That's all you want, isn't it?No, it isn't! You want to pop the question, of course!Getting hitchedYou've made all this effort and spent money on getting your "date" this far, so you want to get the most out of your investment, right? You want to hear those sweet words, "I do":I do? want to click your Adsense ads. I do? want to become a forum member. I do? want to buy that cool digital thingy.So how do you turn a whirlwind romance into a successful proposal? Try these ideas for size:Have clear navigationYou want to make your date dizzy with excitement, not frustration! Make sure you have clear, consistent, well-labeled navigation and make your visitor-date's time with you a pleasure, and not a reason to reach for the Advil.Have a clear propositionWhat do you want your visitor to do? Buy something? Sign up for something? Join something? Whatever the point of your site is, it does have a point, right? So don't beat about the bush, and don't hold back waiting for the right moment: Make your "proposal", and make it clear and direct.Happily Ever After?You thought that was the end of the story? Wait a minute, it's not over yet! I thought you were in this for a long-term relationship, not a one night stand? I know, you were thinking that having made all that effort while you were dating, you could ease off now that you're "married". But don't you know that the real rewards start here?So what are the secrets of a long, blissful relationship with your visitors?1. Feed them!You convinced him to come visit once. But how often will he come back if all you give him are the same old leftovers? Cook him up something juicy and delicious, and have him salivating for more:Enliven your content with new, relevant pages and, not only will you set the search engine robots drooling, but you will also find you have a hoard of visitors coming back for seconds. And in this age of fast-(visitor)-food creation tools called blogs, there's really no excuse anymore.2. Shower them with giftsEveryone likes a gift - and not just on birthdays, anniversaries and at Christmas:Money talks, and if you are selling anything through your site, then discounts and other offers to previous visitors and customers will draw them back like magnets. Throw in tempting freebies as well and you will have a potent marketing mix.3. Send them "love" lettersAnyone who has been in a long-distance relationship (like my wife and I were for a year and a half), knows that if you can't be together, the next best thing is a letter or email. For your website, this means an interesting ezine that is light on sales pitch and rich in information.4. Start a family!Not literally! But you can create a "family" atmosphere by adding a forum where your visitors can interact with each other ? not just with you and your site. People are social animals, so use a forum to keep them coming back.OK, so perhaps I overstated it a little in my title. SEO is not dead, so don't forget about it quite yet. Just remember that it will only get you so far: If you want to actually achieve anything with your site, then you need to focus on conversion like a laser.

SEO: When Being Optimized Can Hurt
It's a marketing dream come true: A potential customer, looking for what you have to offer, types a few words into her favorite search engine and voila! She is led directly to your website where she can go from "prospect" to "customer."

The Life Blood of Internet-Based Home Businesses...Search Engines
Anyone involved in an Internet-based home business will soon cometo recognize the importance of search engines as a vehicle toattract potential customers.

Search Engine Optimization for Beginners
If you are confused about terms like "search engine optimization" or having a "search engine friendly" site, then listen up! I am here to help.

Do the Search Engines Know Your Website?
Are you considering a search engine promotion campaign to improve your website's search engine visibility? To aid in your decision, have you checked your website to determine its search engine awareness?

home | site map | www.1001topwords.com