ted serbinski – entrepreneur & web architect
  • thoughts
  • about
  • contact



Popular content

  • Reducing Drupal blog spam
  • Redesigned: tedserbinski.com
  • TWiT.tv
  • Rebuilding a BMW intake: S52 to M50 intake manifold conversion, day 3
  • Dabble
  • Best Place to Buy Macbook Pro RAM
  • Sony Musicbox
  • Mom Blog Network's Drupal Widget System
  • Website Crashes IE8 Browser with sysfader.exe Exception
  • I'm a Google SoC mentor
more

Recent comments

  • Unfortunately this method
    2 weeks 1 day ago
  • I’m using this method to sort
    7 weeks 6 min ago
  • I was interested in reading
    8 weeks 5 days ago
  • Ah yes this code is a bit out
    12 weeks 2 days ago
  • After using the original code
    12 weeks 2 days ago
more

Automatically Extracting Tags from Nodes

Automatically tagging content is becoming easier with services like OpenCalais and Yahoo Terms Extractor, offering their APIs for free semantic analysis of content. There’s even a great Drupal module, Auto Tagging (with a great writeup on usage) that ties these services together and makes it even easier.

However, there is still one common issue with these services: they really need nicely written, rich, keyword dense articles to produce the most logical, semantic tags.

Try any of those services with user generated content and you’ll see a common tag each time around: FAIL.

We experimented with over 20,000 pieces of content on MothersClick and our results showed that these semantic services weren’t producing quality & relevant tags: rather, we were getting very little, if any relevant tags for our user generated content.

After a little more trial and error, I then noticed a simple pattern: more often than not, the title to a user’s post usually had the most applicable keywords to what their post was about, rather than the body of the post.

So how to extract just the keywords and make tags from the title of a node?

Well, taking each word in the title could work, but that would also include a bunch of words like “a, an, the, with” etc.. The more technical term for those words is stop words and luckily with some Googling, there are some nice stop word lists out there for filtering.

Attached below is a simple JavaScript function for removing stop words from a string. Once the stop words are removed, you’re left with an array or string keyword candidates. Plug this into your tagging system and you’re off to a nice set of automatically generated tags. While not perfect, for user generated content these do work fairly well.

If you’re using Drupal and the active tags module, you can use the following code to automatically insert these suggested tags for the user as they create a piece of content.

Note: there is some extra code that strips non-alphanumeric characters and makes things lower case as well, this could be removed/changed based on your site’s requirements

  1. /**
  2.  * Automatically determine Drupal taxonomy tags based on the user entered form title.
  3.  */
  4. Drupal.behaviors.autoTag = function() {
  5.   var tagged = false;
  6.   $("#edit-title").blur(function() {
  7.     // only process this once to prevent tag oddness
  8.     if (!tagged) {
  9.       var words = ($(this).val()).split(" ");
  10.       $.each(words, function(i, val) {
  11.         // strip anything nonalphanumeric & make lowercase
  12.         val = val.replace(/[\W]+/g,"").toLowerCase();
  13.         // trim just in case of excessive spacing
  14.         val = $.trim(val);
  15.  
  16.         // if this isnt a stopword add it as a tag
  17.         if (!isStopWord(val)) {
  18.           // add this to the first active tags enabled
  19.           activeTagsAdd(Drupal.settings.active_tags[0], val);
  20.         }
  21.         activeTagsUpdate(Drupal.settings.active_tags[0]);
  22.       });
  23.       tagged = true;
  24.     }
  25.   });
  26. };
While not perfect, we have found that this simple technique has resulted in quite an improvement in helping our users tag content, which when you consider the busy mom lifestyle, is a feat in and of itself! :)
AttachmentSize
stopwords.js.txt8.21 KB
posted 20 Nov 2009
  • drupal
  • jquery
  • tags

1 comment

#1
greggles wrote 40 weeks 5 days ago

I’m glad to see this review. I was about to embark on writing yet-another module and am glad to know that the Auto Tagging module exists so that I can just plug into that.

One other module to consider is http://drupal.org/project/inform

Add your comment

The content of this field is kept private and will not be shown publicly.
  • You can use Textile markup to format text.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <p> <img> <pre>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]". PHP source code can also be enclosed in <?php ... ?> or <% ... %>.

More information about formatting options


Code examples and downloadable zip files of code are licensed under a Creative Commons License.
All other content, unless where noted, ©2010 Theodore Serbinski. All Rights Reserved.