SEO Newsletter | Volume 61 | November 17, 2008

FEATURE: PubCon 2008 Breakdown and Wrap-Up
The Las Vegas Convention Center was filled to the brim last week with the best Internet marketing minds on the planet. PubCon 2008 kicked off on Tuesday with a scheduled 90 sessions, three keynotes, several parties and one very hardcore networking event -- PubCon Classic, a day of networking and drinking at Hofbrauhaus. This article covers all the trends that we saw while in Las Vegas.

BACK TO (not so) BASICS: Programming for SEOs: How to Easily Write Custom Data Extraction Scripts
Or, How to Use Python to Spider a Web Site

In the title of this article, the word "easily" might overstate things a little. However, as an SEO, there's nothing more frustrating than the mounds of semi-structured data we have to wade through to get to the bits useful to our job. Having access to the development systems for a site makes the job of an SEO somewhat simpler, but it is not often the case that an SEO will have such access. Usually our interface to a site is the same as anyone else's — a Web browser. Third-party site analysis tools are useful, as far as they go, but they're limited too — you're hemmed in by the vision of the tool maker.


Hot Topics

At PubCon this month, search engine representatives were eager to share the word about new features, products and updates. Many of them were highlighted during the Search Engine Smackdown session and many others were shared via official blogs. Here's a rundown of some of these notable updates.

Google

Google was pretty noisy, with a number of new services to announce. Google Flu Trends launched, a program that tracks the number of searches being performed on flu-related terms for remarkably accurate predictions on the spread of influenza. It highlighted the incredible usefulness of Google Trends, which also got plenty of chatter throughout the conference sessions. Gmail now offers the ability for voice and video chat and Google announced that a new iPhone application would allow users to search with the power of voice.

The former invite-only beta program Google Ad Planner was made available to all Google account holders and a number of new features were added to the program. The search engine published the Search Engine Optimisation Starter Guide. Google's Site Search tool got an upgrade.

Yahoo

Meanwhile, Yahoo dumped its on-site search tool, Yahoo Search Builder. At PubCon, Yahoo touted improvements to open developer platform SearchMonkey. A new gallery allows users to view experimental applications and newly released local search applications Citysearch and Zagat were made available. Yahoo's open search Web services platform BOSS received press at PubCon, as well. Yahoo Search Assist has seen a number of improvements in the year since its initial release, including availability across browsers and mobile devices.

Microsoft Live Search

Over at Live Search, Project Silk Road stole the show. The Live Search API for publishers and Web developers aims to provide tools and services to help drive traffic and increase engagement. Live Search Video was updated with enhanced browsing capabilities and Microsoft announced a distribution deal with Sun Microsystems for its MSN Toolbar.

Programming Note: The SEO Newsletter continues its monthly publishing schedule and will be published on or near the 15th of each month. Adopting the monthly publishing date allows us to maintain the integrity and authority of the newsletter, while accommodating heavier travel schedules and increasing internal demands. You'll still receive the same great SEO news and educational articles, just in a thicker, more comprehensive format. Look for a return to our semimonthly schedule in the new year.


Shuffles

According to a Challenger, Gray and Christmas report, 180,000 tech industry jobs (pdf) are expected to be cut by the end of the year. More than 140,000 have been cut this year to date with 70,000 jobs lost in the tech sector during the third quarter.

Those job cuts include a lay off of 1,500 Yahoo employees, 10 percent of its workforce. Early this month, LinkedIn also cut 10 percent of its workforce, or about 36 jobs.

Veoh, a video-sharing Web site, said it was going to downsize 20 percent of its workforce, or about 20 jobs. After a cut of 10 percent of its workforce earlier this year, microprocessor supplier Advanced Micro Devices cut an additional 500 employees this month, or about 3 percent of its staff. Movable Type developer Six Apart cut 8 percent of its 200-person workforce. Social media darling and customer service powerhouse Zappos also cut 8 percent of its workforce.

Lifehacker said farewell to prolific tech blogger Tamar Weinberg. Andreas von Bechtolsheim, co-founder and chief architect of Sun Microsystems, left to join start-up Arista Networks as chairman and chief development officer. Arista also announced it had recruited its new CEO, Jayshree Ullal, former senior VP of Cisco Systems's Data Center.

Gawker Media merged Silicon Valley gossip blog Valleywag into its larger media gossip property, Gawker.com. Google added four new staff members to post on behalf of the search engine on Google Groups threads.


Sound Bytes

If you like what you read in the SEO Newsletter, there's more Internet marketing expertise where that came from. Check out SEM Synergy every Wednesday at 3:00 p.m. Eastern and Noon Pacific on WebmasterRadio.fm. Bruce Clay and the other hosts discuss industry news, SEO tactics and marketing trends, while expert guests share their insights on methods, best practices and upcoming events. Check out the show schedule below for a look at recent shows and upcoming topics.

November 5
(Listen Now)

International SEO - Europe

Motoko Hunt

International SEO - Asia Pacific

November 12
(Listen Now)

Optimising Landing Pages

Lauren Vaccarello

Conversion Tracking of an SEO Campaign

November 19
(Coming Soon)

PubCon Recap

Sage Lewis

How to Get a Maps Listing

November 26

White Space Opened to Wi-Fi

Jon Kelly

News

Got something to say? Contact the SEM Synergy team and share your thoughts, comments and questions. You might even hear your question answered on the show.


Shindigs

Tomorrow kicks off the first day of Search Engine Academy South Carolina, taking place from November 18-20. The following week, ad:Tech Shanghai will be held November 25-26.

In December, the Email Insider Summit will be December 7-10 in Park City, Utah. Sister conference Search Insider Summit will be December 10-13 in the same city. Wrapping up the year, SES Chicago will be held December 8-12.

A reminder, there will be no SEOToolSet Training in December. Join us in the New Year for our world-class search engine optimisation training when the SEOToolSet training and Advanced Certification courses return on January 12-16. Also, keep your eyes peeled for new East Coast training dates.


Attaboys

In a boon to Internet and computing companies as well as Internet users, earlier this month the FCC agreed to open the white space broadcast spectrum to wireless broadband services. Social bookmarking site Delicious turned five years old this month.

Mozilla's browser Firefox reached 20 percent market share this month. Around the same time, Firefox also released a new privacy feature in its 3.1 version. A newly released Firefox plugin, Glync, integrates into a Google Webmaster Center account to track links over time.

In response to conservation concerns, online auction site eBay announced that it would ban sales of ivory beginning January 1, 2009.

The inventor of everyone's favorite space age gliding device the Segway, Dean Kamen, has joined the effort to develop technologies and create sustainable energy sources in impoverished nations.


Word on the Wire

Rumors circulated early this month that MySpace Music would finally be naming Courtney Holt, MTV Networks's executive VP of digital music and media, to the position of CEO.

Net Neutrality legislation could be passed by Congress as soon as 2009. Google may be launching Friend Connect, a social feature sharing site, soon.

In response to a report released by UCLA which proposed that searching online stimulates the brain and improves brain function, Gord Hotchkiss wondered if online searching may actually cause the brain to remap itself.



If you have any questions or comments on any of the articles above or if you would like to suggest topics for future articles, please contact us at Bruce Clay, Australia

Print the full SEO Newsletter






 

FEATURE: PubCon 2008 Breakdown and Wrap-Up

by Susan Esparza, November 2008

The Las Vegas Convention Center was filled to the brim last week with the best Internet marketing minds on the planet. PubCon 2008 kicked off on Tuesday with a scheduled 90 sessions, three keynotes, several parties and one very hardcore networking event -- PubCon Classic, a day of networking and drinking at Hofbrauhaus.

Despite the cost of attending the conference jumping -- ticket prices alone were up 30% over the year before, not to mention the difference in airfare, accommodations and cab fare in a year that saw just about every industry fighting an oncoming recession -- attendance grew 14 percent over the previous year. It was a crowd that came interested in the basics, driven largely perhaps by the accountability and specific focusing capability of Internet marketing. According to chatter on the floor, 80 percent of attendees were new to the show. As a result, sessions tended to be at a 101 level, without a great deal of advanced content. However, don't let that fool you. PubCon was anything but lightweight.

Back to basics SEO

The focus on the fundamentals of Internet marketing wasn't just a message for the new attendees in the audience. Panelists made sure that it was obvious to everyone that getting back to basics with your search engine marketing campaign was critical. Even Top-Shelf Organic SEO, the first search engine optimisation-specific session and one that was slated to cover "mid-level to super-advanced organic SEO," began with Jill Whelan myth-busting common SEO misconceptions surrounding what SEO is and is not.

The emphasis on knowing your entry-level search marketing was backed up on Wednesday by Google's release of a 22-page-long starter guide for SEO. Though some in the industry took issue with the advice that Google handed out, it was nevertheless a clear sign that knowing at least the basics of SEO is critical. It is also, as Matt Cutts pointed out in the final session of the conference, a clear sign that Google does not hate SEO or webmasters.

It is critical that every marketer have a solid grasp on the fundamentals of search because the game is about to change again, warned our own Bruce Clay. He boldly declared that 2009 is going to see a huge shift toward several new ways of doing search: behavioral-based, intent-based and blended search results. In particular, he predicted that blended search would come into play even more strongly.

Engagement objects

When Google introduced Universal Search, they began to say that they judged rankings based on over 200 signals, a 70 point jump over their previous statement. It is Bruce's belief that those 70 elements have not yet been fully implemented and that in early 2009, webmasters are going to have to be ready for a sudden surge in the importance of engaging the visitors through the use of images, videos, audio, maps, advanced technologies like Flash and AJAX and the integration of RSS feeds. Collectively these elements can be called "engagement objects."

A main message of the conference was the reminder of Google's mission statement: to organize the world's information and make it universally accessible and useful. And they have made huge strides in that direction. Earlier this year Google announced that they were able to pull the text out of Flash files, causing great excitement among Web designers and consternation among search marketers who were already having a hard time convincing clients that an all-Flash site was not optimal. What nearly everyone missed in the commotion was that this wasn't the first such development Google had made. In the last year, Google has been working very hard at getting the content out of video and audio files as well, as well as doing image recognition on all those PDFs that are simply scans of catalog pages.

Traffic is the bottom line

Naturally with the implementation of intent-based searches, demographically and behaviorally targeted searches, not to mention the blending of myriad types of content including geographically targeted content, into what was previously "ten blue links," it is becoming less and less likely that search results will look the same to any two given people for the same query. How do search marketers measure success in a world where rankings no longer can be compared accurately? The answer has to be traffic and conversion.

If you are not running an analytics program on your site, taking a baseline, learning and understanding where you get your traffic and conversions and how much you're paying for each of those conversions, you are setting yourself up for failure in the future. Google Analytics is free and contains many of the features that a paid analytics service provides. Omniture services are more comprehensive still and come with Omniture's world class support. Either way, you have to understand that it no longer matters if your site is number one on Google for your keyword term when you do a search because what everyone else is seeing may not be the same.

Link building = social media

It may be that this move toward traffic instead of rankings is what drove the other main theme of this conference. Link building has always been a huge topic at Internet marketing conferences, but more than ever, this conference highlighted that link building is coming to mean learning to play in social media marketing. The 5 Bloggers and a Microphone session on Wednesday hammered home an even more surprising message: Twitter is the new black. If you're not participating on Twitter and monitoring your company name, you're missing a huge opportunity to engage and brand your company and services.

Overall, the message of this conference was to create "halo media." As defined by Shawn Rorick in the kick-off keynote on Tuesday morning, with halo media, you create a "circle of presence" around your company using multiple channels to reach the consumer. Halo media eliminates the problem of mismatching the message to the consumer's buying cycle by focusing on giving the customer ways to experience, reference, discuss and purchase when they are ready. The user isn't asked to "click and buy." This message was reinforced over and over again: do your fundamentals, target the long tails, increase engagement and make sure that when the customer comes looking for you, you're there to be found.

For more from PubCon, check out the Bruce Clay blog for liveblog coverage of 17 sessions from Las Vegas.


For permission to reprint or reuse any materials, please contact us. To learn more about our authors, please visit the Bruce Clay Authors page. Copyright 2008 Bruce Clay, Inc.

 



BACK TO (not so) BASICS: Programming for SEOs: How to Easily Write Custom Data Extraction Scripts


Or, How to Use Python to Spider a Web Site

By Mike Terry, November 2008

Introduction

In the title of this article, the word "easily" might overstate things a little. However, as an SEO, there's nothing more frustrating than the mounds of semi-structured data we have to wade through to get to the bits useful to our job. Having access to the development systems for a site makes the job of an SEO somewhat simpler, but it is not often the case that an SEO will have such access. Usually our interface to a site is the same as anyone else's — a Web browser. Third-party site analysis tools are useful, as far as they go, but they're limited too — you're hemmed in by the vision of the tool maker.

There is a better way, but it requires biting the bullet and learning some programming. It's not for the faint of heart, but for those who put in the time and effort to acquire a few basic skills, it pays off in spades. I'll show you what I've found to be the easiest approach to mining client sites for SEO-relevant data. If you borrow the approach outlined in this brief tutorial, you'll spend less time on the parts of automated site processing that are always identical and spend more time on just the parts that vary for your site and the data you want to get.

When you think about pulling data out of a site, it's often useful to pretend it's a flat list of pages. Thinking about the site as a graph of nodes with a complex link structure usually just confuses things; it's less intimidating to think of it as a list. Of course, thinking about site data this way is only useful if you have the right tools to let you abstract away the complicated part. That's what a good Web spider framework does for you. It scans through a site, keeps track of links, tosses out duplicates, gives you a few obvious configuration options, and notifies you of interesting events so you can respond to them with custom code.

If you can think of a site as a list of pages, then you only need be concerned with how to filter that list. My favorite spider framework for this is a third-party module for Python called Ruya. I looked at spider frameworks in several languages: PHP, Perl, Python, Ruby, and Erlang. In my opinion, Ruya had the easiest, best interface for an SEO's needs. It's not the fastest or most powerful crawling library, but that's not what you're looking for when doing ad hoc analysis. Instead you want to optimise for development time.

Set-up

You'll have to have a development environment, which is beyond the scope of this article. On Windows, I prefer getting Python as a part of a Cygwin install and using Emacs for my code editor. Python comes with an interactive shell feature that allows you to play around with it and learn organically. Emacs can extend the interactive shell to allow a truly blissful incremental development environment.

Once you've installed Python, the next step is to install the necessary Python modules. To find out where you should install the modules on your system, type the following into an interactive shell:

import sys
sys.path

On my system, this is the result:

['', '/usr/share/emacs/22.1/etc', '/usr/lib/python25.zip', '/usr/lib/python2.5', '/usr/lib/python2.5/plat-cygwin', '/usr/lib/python2.5/lib-tk', '/usr/lib/python2.5/lib-dynload', '/usr/lib/python2.5/site-packages']

For these instructions, I'm going to use /usr/lib/python2.5/site-packages to house my non-standard modules. We need to install four third-party modules into this directory. Modules provide libraries of code that extend the core functionality of the language. Download each of the following into the module directory found above:

  1. BeautifulSoup - This is a module for parsing "dirty" HTML. It's critical that you can parse malformed HTML, because that's what the Web primarily consists of.
  2. Ruya - This is a module for spidering Web sites.
  3. Kconv - A module required by Ruya that converts the source of downloaded pages to UTF8. After uncompressing, you'll be left with a folder called kconvp. Pull the folder kconv out and into the module directory.
  4. htmldata - A module required by Ruya that, like BeautifulSoup, does "tag soup" parsing. After downloading, change its extension to '.py'.

To test that you've got the modules installed correctly, type the following into the Python shell:

import BeautifulSoup
import ruya
import kconv
import htmldata

If all is well, you'll get your prompt back silently. Otherwise, an error will be printed to the screen.

Simplest Example

The following is the simplest example possible.

#!/usr/bin/env python

import ruya, logging

def aftercrawl(caller, eventargs):
  page = eventargs.document

  print 'Url: ' + page.uri.url
  print 'Title: ' + page.title

if('__main__'== __name__):
  url = 'http://directory.google.com/Top/Computers/Internet/Web_Design_and_Development/Promotion/'
  page = ruya.Document(ruya.Uri(url))
  c = ruya.Config(ruya.Config.CrawlConfig(crawlscope=ruya.CrawlScope.SCOPE_PATH), ruya.Config.RedirectConfig(), logging.getLogger())
  spider = ruya.SingleDomainDelayCrawler(c)
  spider.bind('aftercrawl', aftercrawl, None)

  spider.crawl(page)

To execute on a Unix-similar system (Linux, Cygwin, Mac) do the following from a shell or terminal:

  1. Save the file in /path/to/folder/file.py.
  2. Enter cd /path/to/folder into the shell and hit enter.
  3. Enter chmod 755 file.py into the shell and hit enter.
  4. Enter ./file.py into the shell and hit enter.

If all goes well, the URL and page title of every page below the Promotion directory in the Google Directory will be printed to the screen.

You can actually do a lot of common tasks with Ruya's built-in data structures and methods. Inside the <code>aftercrawl</code> function, you have access to all of the following variables:

  1. title
  2. description
  3. keywords
  4. lastmodified
  5. etag
  6. httpstatus
  7. contenttype
  8. contentencoding
  9. redirecturi
  10. redirects

In addition to those, you can get the HTML of the processed page. In the next example, we'll combine that fact with the powerful HTML parser BeautifulSoup to show how you can extract absolutely any interesting item from a set of pages.

Extending the First Example, with BeautifulSoup and Commenting

This script descends through every category below Promotions in the Google directory, making a tab-delimited text file of each directory link and the page it was found on:

# The "shebang" line: when running from the command line, tells the system to use Python to
# interpret the code
#!/usr/bin/env python

# Loads the built-in modules and 3rd party modules we installed earlier
import ruya, logging, re
from BeautifulSoup import BeautifulSoup

# This is an event handler. It's called automatically by Ruya whenever a new page
# has been crawled and gives us an opportunity to run custom code.
def aftercrawl(caller, eventargs):
  # Ruya has parsed the current page into a data structure. Here we get a reference
  # to it.
  page = eventargs.document
  # Get the URL of the current page.
  uri = page.uri
  # Use the fantastic BeautifulSoup module to parse the current page's source code into
  # a data structure we can use to easily gain access to all its elements.
  soup = BeautifulSoup(page.getPlainContent())
  # The next two lines are the hard part. Use BeautifulSoup's API to get each directory link on the page.
  eachTag = soup.findAll('table')[-5].findAll(lambda tag: tag.name == "a" and (str(tag['href'])).find('google') == -1)
  eachTag = eachTag + soup.findAll('table')[-4].findAll(lambda tag: tag.name == "a" and (str(tag['href'])).find('google') == -1)
  # Output each link.
  for thisTag in eachTag:
   print(str(uri) + "\t" + thisTag['href'])

# This is the first routine that's called when the script is run.
if('__main__'== __name__):
  # The starting URL.
  url = 'http://directory.google.com/Top/Computers/Internet/Web_Design_and_Development/Promotion/'
  # The necessary initialization stuff.
  page = ruya.Document(ruya.Uri(url))
  # This line tells Ruya how to behave and initialize it. 'levels' means how deep to spider; the default is 2. 'crawlscope' means which pages in the "directory structure" to follow. Set to SCOPE_PATH, the crawler only access links at and below the starting foler.
  cfg = ruya.Config(ruya.Config.CrawlConfig(levels=10, crawlscope=ruya.CrawlScope.SCOPE_PATH, crawldelay=0), ruya.Config.RedirectConfig(), logging.getLogger())
  spider = ruya.SingleDomainDelayCrawler(cfg)
  # This is how you associate your custom code to Ruya's built-in events.
  spider.bind('aftercrawl', aftercrawl, None)

  print "Source Page" + '\t' + "Found URL"
  # And here we start the spider.
  spider.crawl(page)

Summing Up

The examples given here are pretty simple, but that's part of the point. For an SEO's needs, programmatically crawling a site can be easy enough to save a lot of time under many circumstances. Another thing to note about the examples here is that they're necessarily drained of blood. In a real scenario, you'd be filtering for, e.g., every link in a section with the rel attribute set to "nofollow" or a list of pages whose Meta Descriptions contain one of the phrases from a predetermined list.

Unfortunately, I can't turn you loose on some poor, random webmaster's site to experiment with automated robots, and generic high volume sites like Google Directory aren't quite as interesting SEO-wise. Nevertheless, a little imagination makes the potential clear. With a few sprinkles of custom logic, you can extract just precisely what you need from a site to get any job done.


For permission to reprint or reuse any materials, please contact us. To learn more about our authors, please visit the Bruce Clay Authors page. Copyright 2008 Bruce Clay, Inc.