As Chris Silver Smith explains at Search Engine Land, Microsoft’s Bing Business Portal features a beta interface through which you handle your business details. It replaces the Bing Local Listing Center. If you want some information about the service before adding your business, you can read the FAQ. To star using the interface, you’ll need to find your listing. Bing Business Portal will ask you to enter your business name (required), address, city, state, and zip code, and then press the search button to see if Bing already has information about your business. If you enter only your busine…
SEO Chat – Search Engine Optimization Tutorials
Posted by rolfbroer
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
A good search engine does not attempt to return the pages that best match the input query. A good search engine tries to answer the underlying question. If you become aware of this you'll understand why Google (and other search engines), use a complex algorithm to determine what results they should return. The factors in the algorithm consist of "hard factors" as the number of backlinks to a page and perhaps some social recommendations through likes and +1' s. These are usually external influences. You also have the factors on the page itself. For this the way a page is build and various page elements play a role in the algorithm. But only by analyzing the on-site and off-site factors is it possible for Google to determine which pages will answer is the question behind the query. For this Google will have to analyze the text on a page.
In this article I will elaborate on the problems of a search engine and optional solutions. At the end of this article we haven’t revealed Google’s algorithm (unfortunately), but we’ll be one step closer to understand some advice we often give as an SEO. There will be some formulas, but do not panic. This article isn’t just about those formulas. The article contains a excel file. Oh and the best thing: I will use some Dutch delights to illustrate the problems.

Behold: Croquets are the elongated and bitterballen are the round ones
True OR False
Search engines have evolved tremendously in recent years, but at first they could only deal with Boolean operators. In simple terms, a term was included in a document or not. Something was true or false, 1 or 0. Additionally you could use the operators as AND, OR and NOT to search documents that contain multiple terms or to exclude terms. This sounds fairly simple, but it does have some problems with it. Suppose we have two documents, which consist of the following texts:
Doc1:
“And our restaurant in New York serves croquets and bitterballen.”
Doc2:
“In the Netherlands you retrieve croquets and frikandellen from the wall.”

Oops, almost forgot to show you the frikandellen
If we were to build a search engine, the first step is tokenization of the text. We want to be able to quickly determine which documents contain a term. This is easier if we all put tokens in a database. A token is any single term in a text, so how many tokens does Doc1 contain?
At the moment you started to answer this question for yourself, you probably thought about the definition of a "term". Actually, in the example "New York" should be recognized as one term. How we can determine that the two individual words are actually one word is outside the scope of this article, so at the moment we threat each separate word as a separate token. So we have 10 tokens in Doc1 and 11 tokens in Doc2. To avoid duplication of information in our database, we will store types and not the tokens.
Types are the unique tokens in a text. In the example Doc1 contains twice the token "and". In this example I ignore the fact that “and” appears once with and once without being capitalized. As with the determination of a term, there are techniques to determine whether something actually needs to be capitalized. In this case, we assume that we can store it without a capital and that “And” & “and” are the same type.
By storing all the types in the database with the documents where we can find them, we’re able to search within the database with the help of Booleans. The search "croquets" will result in both Doc1 and Doc2. The search for "croquets AND bitterballen" will only return Doc1 as a result. The problem with this method is that you are likely to get too much or too little results. In addition, it lacks the ability to organize the results. If we want to improve our method we have to determine what we can use other then the presence / absence of a term in a document. Which on-page factors would you use to organize the results if you were Google?
Zone Indexes
A relatively simple method is to use zone indexes. A web page can be divided into different zones. Think of a title, description, author and body. By adding a weight to each zone in a document, we’re able to calculate a simple score for each document. This is one of the first on page methods search engines used to determine the subject of a page. The operation of scores by zone indexes is as follows:
Suppose we add the following weights â??â??to each zone:
| Zone | Weight |
| title | 0.4 |
| description | 0.1 |
| content | 0.5 |
We perform the following search query:
“croquets AND bitterballen”
And we have a document with the following zones:
| Zone | Content | Boolean | Score |
| title | New York Café | 0 | 0 |
| description | Café with delicious croquets and bitterballen | 1 | 0.1 |
| content | Our restaurant in New York serves croquets and bitterballen | 1 | 0.5 |
| Total | 0.6 |
Because at some point everyone started abusing the weights assigned to for example the description, it became more important for Google to split the body in different zones and assign a different weight to each individual zone in the body.
This is quite difficult because the web contains a variety of documents with different structures. The interpretation of an XML document by such a machine is quite simple. When interpreting an HTML document it becomes harder for a machine. The structure and tags are much more limited, which makes the analysis more difficult. Of course there will be HTML5 in the near future and Google supports microformats, but it still has its limitations. For example if you know that Google assigns more weight to content within the <content> tag and less to content in the <footer> tag, you’ll never use the <footer> tag.
To determine the context of a page, Google will have to divide a web page into blocks. This way Google can judge which blocks on a page are important and which are not. One of the methods that can be used is the text / code ratio. A block on a page that contains much more text than HTML code contains probably the main content on the page. A block that contains many links / HTML code and little content is probably the menu. This is why choosing the right WYSIWYG editor is very important. Some of these editors use a a lot of unnecessary HTML code.
The use of text / code ratio is just one of the methods which a search engine can use to divide a page into blocks. Bill Slawski talked about identifying blocks earlier this year.
The advantage of the zone indexes method is that you can calculate quite simple a score for each document. A disadvantage of course is that many documents can get the same score.
Term frequency
When I asked you to think of on-page factors you would use to determine relevance of a document, you probably thought about the frequency of the query terms. It is a logical step to increase weight to each document using the search terms more often.
Some SEO agencies stick to the story of using the keywords on a certain percentage in the text. We all know that isn’t true, but let me show you why. I'll try to explain it on the basis of the following examples. Here are some formulas to emerge, but as I said it is the outline of the story that matters.
The numbers in the table below are the number of occurrences of a word in the document (also called term frequency or tf). So which document has a better score for the query: croquets and bitterballen ?
| croquets | and | café | bitterballen | Amsterdam | … | |
| Doc1 | 8 | 10 | 3 | 2 | 0 | |
| Doc2 | 1 | 20 | 3 | 9 | 2 | |
| DocN | … | … | … | … | … | |
| Query | 1 | 1 | 0 | 1 | 0 |
The score for both documents would be as follows:
score(“croquets and bitterballen”, Doc1) = 8 + 10 + 2 = 20
score(“croquets and bitterballen”, Doc2) = 1 + 20 + 9 = 30
Document 2 is in this case closer related to the query. In this example the term “and” gains the most weight, but is this fair? It is a stop word, and we like to give it only a little value. We can achieve this by using inverse document frequency (tf-idf), which is the opposite of document frequency (df). Document frequency is the number of documents where a term occurs. Inverse document frequency is, well, the opposite. As the number of documents in which a term grows, idf will shrink.
You can calculate idf by dividing the total number of documents you have in your corpus by the number of documents containing the term and then take the logarithm of that quotient.
Suppose that the IDF of our query terms are as follows:
Idf(croquets) = 5
Idf(and) = 0.01
Idf(bitterballen) = 2
Then you get the following scores:
score(“croquets and bitterballen”, Doc1) = 8*5 + 10*0.01 + 2*2 = 44.1
score(“croquets and bitterballen”, Doc2) = 1*5 + 20*0.01 + 9*2 = 23.2
Now Doc1 has a better score. But now we don’t take the length into account. One document can contain much more content then another document, without being more relevant. A long document gains a higher score quite easy with this method.
Vector model
We can solve this by looking at the cosine similarity of a document. An exact explanation of the theory behind this method is outside the scope of this article, but you can think about it as an kind of harmonic mean between the query terms in the document. I made an excel file, so you can play with it yourself. There is an explanation in the file itself. You need the following metrics:
- Query terms – each separate term in the query.
- Document frequency – how many documents does Google know containing that term?
- Term frequency – the frequency for each separate query term in the document (add this Focus Keyword widget made by Sander Tamaëla to your bookmarks, very helpful for this part)
Here's an example where I actually used the model. The website had a page that was designed to rank for "fiets kopen" which is Dutch for “buying bikes”. The problem was that the wrong page (the homepage) was ranking for the query.
For the formula, we include the previously mentioned inverse document frequency (idf). For this we need the total number of documents in the index of Google. For this we assume N = 10.4 billion.
An explanation of the table below:
- tf = term frequency
- df = document frequency
- idf = inverse document frequency
- Wt,q = weight for term in query
- Wt,d = weight for term in document
- Product = Wt,q * Wt,d
- Score = Sum of the products
The main page, which was ranking: http://www.fietsentoko.nl/
| term | Query | Document | Product | |||||
| tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
| Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 21 | 441 | 0.70711 | 2.55302 |
| Kopen | 1 | 118.000.000 | 2.945151332 | 2.9452 | 21 | 441 | 0.70711 | 2.08258 |
| Score: | 4.6356 | |||||||
The page I wanted to rank: http://www.fietsentoko.nl/fietsen/
| term | Query | Document | Product | |||||
| tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
| Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 22 | 484 | 0.61782 | 2.23063 |
| Kopen | 1 | 118.000.000 | 2.945151332 | 2.945151332 | 28 | 784 | 0.78631 | 2.31584 |
| Score: | 4.54647 | |||||||
Although the second document contains the query terms more often, the score of the document for the query was lower (higher is better). This was because the lack of balance between the query terms. Following this calculation, I changed the text on the page, and increased the use of the term “fietsen” and decreased the use of “kopen” which is a more generic term in the search engine and has less weight. This changed the score as follows:
| term | Query | Document | Product | |||||
| tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
| Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 28 | 784 | 0.78631 | 2.83897 |
| Kopen | 1 | 118.000.000 | 2.945151332 | 2.945151332 | 22 | 484 | 0.61782 | 1.81960 |
| Score: | 4.6586 | |||||||
After a few days, Google crawled the page and the document I changed started to rank for the term. We can conclude that the number of times you use a term is not necessarily important. It is important to find the right balance for the terms you want to rank.
Speed up the process
To perform this calculation for each document that meets the search query, cost a lot of processing power. You can fix this by adding some static values â??â??to determine for which documents you want to calculate the score. For example PageRank is a good static value. When you first calculate the score for the pages matching the query and having an high PageRank, you have a good change to find some documents which would end up in the top 10 of the results anyway.
Another possibility is the use of champion lists. For each term take only the top N documents with the best score for that term. If you then have a multi term query, you can intersect those lists to find documents containing all query terms and probably have a high score. Only if there are too few documents containing all terms, you can search in all documents. So you’re not going to rank by only finding the best vector score, you have the have your statics scores right as well.
Relevance feedback
Relevance feedback is assigning more or less value to a term in a query, based on the relevance of a document. Using relevance feedback, a search engine can change the user query without telling the user.
The first step here is to determine whether a document is relevant or not. Although there are search engines where you can specify if a result or a document is relevant or not, Google hasn’t had such a function for a long time. Their first attempt was by adding the favorite star at the search results. Now they are trying it with the Google+ button. If enough people start pushing the button at a certain result, Google will start considering the document relevant for that query.
Another method is to look at the current pages that rank well. These will be considered relevant. The danger of this method is topic drift. If you're looking for bitterballen and croquettes, and the best ranking pages are all snack bars in Amsterdam, the danger is that you will assign value to Amsterdam and end up with just snack bars in Amsterdam in the results.
Another way for Google is to use is by simply using data mining. They can also look at the CTR of different pages. Pages where the CTR is higher and have a lower bounce rate then average can be considered relevant. Pages with a very high bounce rate will just be irrelevant.
An example of how we can use this data for adjusting the query term weights is Rochio's feedback formula. It comes down to adjusting the value of each term in the query and possibly adding additional query terms. The formula for this is as follows:

The table below is a visual representation of this formula. Suppose we apply the following values â??â??:
Query terms: +1 (alpha)
Relevant terms: +1 (beta)
Irrelevant terms: -0.5 (gamma)
We have the following query:
“croquets and bitterballen”
The relevance of the following documents is as follows:
Doc1 : relevant
Doc2 : relevant
Doc3 : not relevant
| Terms | Q | Doc1 | Doc2 | Doc3 | Weight new query |
| croquets | 1 | 1 | 1 | 0 | 1 + 1 – 0 = 2 |
| and | 1 | 1 | 0 | 1 | 1 + 0.5 – 0.5 = 1 |
| bitterballen | 1 | 0 | 0 | 0 | 1 + 0 – 0 = 1 |
| café | 0 | 0 | 1 | 0 | 0 + 0.5 – 0 = 0.5 |
| Amsterdam | 0 | 0 | 0 | 1 | 0 + 0 – 0.5 = -0.5 = 0 |
The new query is as follows:
croquets(2) and(1) bitterballen(1) cafe(0.5)
The value for each term is the weight that it gets in your query. We can use those weights in our vector calculations. Although the term Amsterdam was given a score of -0.5, the adjust negative values back to 0. In this way we do not exclude terms from the search results. And although café did not appear in the original query, it was added and was given a weight in the new query.
Suppose Google uses this way of relevance feedback, then you could look at pages that already rank for a particular query. By using the same vocabulary, you can ensure that you get the most out of this way of relevance feedback.
Takeaways
In short, we’ve considered one of the options for assigning a value to a document based on the content of the page. Although the vector method is fairly accurate, it is certainly not the only method to calculate relevance. There are many adjustments to the model and it also remains only a part of the complete algorithm of search engines like Google. We have taken a look into relevance feedback as well. *cough* panda *cough*. I hope I’ve given you some insights in the methods search engine can use other then external factors. Now it's time to discuss this and to go play with the excel file
Do you like this post? Yes No
Posted by @aaron_wheeler
Howdy mozfans! This week’s Whiteboard Friday features the return of Danny Dover, our lead SEO here at SEOmoz. He’s going to be discussing the basics of local SEO, a rapidly developing, important niche in SEO land that involves a complex amalgamation of many data sources and metrics. Hey, sounds a lot like the regular SEO we know and love! Take a look at what’s on Danny’s whiteboard here below the video.
Danny’s Whiteboard:
SEO Local: Behind the Scenes:
- Most important: accessibility and content
- Second most important: keyword research and targeting
- Third most important: links
- Fourth most important: social
SEO Local-Specific Features/Considerations
- Search engine page
- Local directory submissions
- Yahoo Local
- Yelp
- Citysearch
- Urbanspoon
- Trip Advisor
- Judysbook
- Insider Pages
- Niche Data Sources
- Links
- Addresses
- Categories
- Reviews
Other Metrics Worth Considering
- Title (Business Name)
- Photos
- Social
Video Transcription
Coming soon…
Video transcription by SpeechPad.com
Follow Danny on Twitter! Even more to your benefit, follow SEOmoz! You know what? Why don’tcha follow me too: Aaron Wheeler.
If you have any tips or tricks that you’ve learned along the way, we’d love to hear about it in the comments below. Post your comment and be heard!
Do you like this post? Yes No
Tags: Basics, Friday, Local, Whiteboard
Posted by Aaron Wheeler
Video SEO isn’t something we always think about when optimizing, but we really should. In this week’s Whiteboard Friday, Danny Dover reviews some of the video SEO basics that every SEO should know about. After all, it’s a largely untapped market, unlike the Canadian maple tree market. Which is very tapped. (The Canadian maple tree video market, however, is quite untapped, but based on my scientific and extremely boring research in YouTube, I don’t recommend you pursue that market at all).
Anyways, we have a very special visitor this week, what with all of Danny’s meta discussions this month. Great Scott! That’s what happens when you get all meta and self-referential on us, Danny.
Video Transcription
Hello, everybody. My name is Danny Dover. I work here at SEOmoz doing SEO. For today’s Whiteboard Friday we’re going to be talking about video SEO. Now, last week I mentioned that was the most meta video we’d ever done. It was optimizing SEO resources, right? Now, this one is a video on video SEO. So this one, this one is the new champion of the most meta video that we have ever done here, and possibly the most meta video that you have ever seen. If there is some kind of disruption in the space-time continuum, totally my fault. I apologize.
–1.21 Gigawatts!?!–
That was unexpected. That was Doc from Back to the Future. A poor impression of it. Totally derailing my Whiteboard Friday. You’re killing me.
All right. Now, video SEO, huge opportunity here. This is more of a serious thing. Video SEO has low competition. You see in the universal results that video thumbnails show up about a third of the way from the top, right. You’re seeing little thumbnails. A lot of times it’s YouTube, but you also see Vimeo and lots of other video providers showing up. You are seeing those in lots and lots of SERPs, and increasing so actually. There is a huge demand from people because, you know, Google is doing A/B testing or multivariate testing. They’re seeing people are clicking on those. But, at the same time, you’ll have low competition. You’ll see a lot of times for very high competition keywords that have video results that the video results will just be kind of mediocre. They just kind of showed up there. Part of that is because it is new. Not a lot of people are optimizing for video, which is becoming extremely important. So, a lot of opportunity there.
The other part of this, I guess I can only talk for the United States, where I live, but the way that people are starting to consume media is changing drastically. We’ve all seen YouTube. We’ve all seen Vimeo. Now the devices people are using and the places they are watching video are different. You have things like the iPhone, the iPad, and the iPayWayTooMuchForGadgets and I am an Apple fanboy, kind of thing. You’re seeing these all over the place. There is the Android model, the operating system that is running lots and lots of things. system. You’re seeing the way that people are consuming media very differently. The market is growing. Based on that, the demand is high but the competition is really low. Lots of opportunity. This smells like money to me. This is huge. This is a big deal.
How do you take advantage of this? Well, there are different metrics the search engines use to look at video content. When the search engines crawl normal content, they can get some kind of idea of what text is trying to say by using their natural language processing algorithms. They can get some idea of what this text says just simply because they put so much time and so much energy into developing these algorithms to get some kind of semantic feeling for what text means. Now, this doesn’t translate directly into video because, part of the reason at least, is video is much bigger files. It takes a lot more processing to get an understanding of it. It is a lot more zeros and ones. With these Google and the search engines have provided Meta information that you can do about a video.
The two most important ones here are the title of the video — what do you title your video. That’s probably what people are going to search for, right. If it is the shoes video on YouTube or whatever it may be on YouTube. Those are a lot of times what people are searching for. That information turns out to be very important for video SEO.
Likewise, the description is also very important because it gives you more than whatever may be the character limit, probably around 140, I would guess for the title. But it gives you more text to describe it in more depth. This helps the search engines understand the video without having to go through all the intensive video processing.
Now, as video SEO is maturing, we’re starting to see more and more metrics start to affect the algorithm. So, let me be totally straightforward with this. This is just my speculation. I have not done tests on these ones. But they seem very likely to be impacting the video search results. My guess would be that they’ll be more impactful going forward. So, they are something to start paying attention to now.
The first one I see here is engagement stats. The most obvious one here is views. How many times is a video viewed? I know that when I go to YouTube and I search for something, after I look at the text, the title and the description, I then look at the views. Has this been watched 30 times or has it been watched 10 million times? It seems very, very likely to me that click-through rates are going to correlate with high view rates also. So, I think views are becoming increasingly important and are something that you should keep an eye on.
Number two is ratings. So, on YouTube they offer a five-point scale. On things like Vimeo and other things, they use a thumb up and a thumb down. That’s more similar to the Reddit system. These are actual humans who are giving their opinions and their expertise on video content. This is very helpful because search engines are designed to provide results for humans. Any imput you can get from humans is helpful for getting output for humans. This is something that Google figured out very early and is something that is very important.
Number three, comments. What could be more human than commenting on videos? In YouTube’s case, it is some of the lowest thresholds of intelligence we’ve ever seen on the Internet, which is really saying something. You have floor chant, below that you have YouTube comments. It is kind of rough, right. But this is a metric of actual human beings engaging with content and with the author or producer of the video. This seems like a very important metric to me. I don’t think it is the content of the comments, because they are awful. But I think it is the volume of it and the kind of themes that people are talking about. Are they saying, "this is awesome" or "this sucks?" I think that does have some kind of impact on it.
The last one is social metrics. Really, I think this is universal. It is not just the video vertical; I think it is the other verticals as well. By social metrics, I mean things like the amount of tweets or what people are saying in tweets, Delicious popular saves, or submissions to Reddit or Digg or any of those other things. How are people talking about this with their friends? So, you have things like the QDF algorithm, which is Google’s Query Deserves Freshness algorithm. What this does is it will artificially inflate the ability for something to rank based on temporal metrics. So, if lots and lots of people are linking to something or tweeting about it, then it can artificially rank higher than things that normally wouldn’t just because it is very important. You see this a lot of times with natural disasters. Things will just rise to the top when normally they wouldn’t. Michael Jackson stuff. We saw lots and lots of QDF stuff really blowing, making things rank when normally there was no way they would. This is something to keep in mind also. These social metrics.
Now, duration. I think is the last one. This one is more about the extremes, finding the outlier. If a video is three seconds long, it is probably not something that Google, Bing, or Yahoo will want to rank highly. At the same time, if it is something that is multiple hours long, they might want to rank it, but it is probably not what people are going to look for when they are doing video. One of the things about video and content on the Internet in general is that people want to consume it quickly. They like bulleted lists. They like quick pictures, inforgraphic types of things, and they like short videos. I should probably take my own advice and get to the end here. So, I’ll try to do that.
The last one we have for you is tactics. I have expressed that there is a huge opportunity here. I have talked about some of the metrics that are important. Now, tactics, the search engines have given you several tools on how to do this. Video sitemaps is, not new, because video sitemaps have existed for a while, but the protocol was recently revamped by the major search engines and the people who are involved with that protocol. They’ve added a couple of things that are interested. They’ve added the location of the thumbnail of the video. They’ve added things like if it is family friendly or not. They’ve added the URL of where the video is embedded. So, from an SEO perspective, this is really interesting. We don’t want links going to YouTube anymore because YouTube has plenty of links. Instead, with the new video sitemap, you can provide the URL of where it is embedded and then when the search engines index that content they’ll link back to you. So, it’s not so much that you get a link from it per se, but you get the click-through. So, someone clicking on the SERP, clicking that thumbnail, is going to go to your blog, where you embedded the video, rather than to the hosting provider. This is a big win for us SEOs and for us content producers.
The other one is transcriptions. So, what could be easier than just going back and using the old tactics you already have for creating content? With transcriptions, you take video, you take the audio from the video, and you turn it into plain text. This is something that the search engines can then use and interpret just like they do a normal web page. This is important for search engines, but it is also important for human beings as well. People with hearing impairments who can’t hear this video right now can then go through and read it. They can understand it that way. International people who are speaking different languages can then go through the content and read at their own pace. They can do whatever tools they need to translate it. It helps spread it more. It is both good for humans and for users, which is a win-win and that’s always the situation I try to get when I do SEO.
I recommend that you always try to go for those win-wins, because ultimately what the search engines are doing is chasing after the idea of getting the best information to human beings. I think that’s what it really comes down to, crafting your content for human beings. It is harder to do with video SEO, but it is becoming more and more possible to do it.
I appreciate your time today. I will see you next week.
Video transcription by SpeechPad.com
Follow Danny on Twitter! Even more to your benefit, follow SEOmoz! Alternatively, you can always follow me, Aaron.
If you have any tips or advice that you’ve learned along the way, or if you came back from the future, we’d love to hear about it in the comments below. Post your comment and be heard!
Do you like this post? Yes No
Tags: Basics, Friday, Video, Whiteboard
