NielsenData

Mind Tricks for Business - Atomic Data Model makes Search Engine Dominance Possible...

March 30, 2010 at 7:01 PMJared Nielsen

Atomic Data makes search engine dominance possible

Online retail is not the same as brick and mortar retail.  When a brick and mortar store launches online they fall into this biggest trap.  Take an apparel shop… when you first walk in you find a men’s department and a ladies department.  The store is physically trying to demographically segment you.

If you create a data model that matches this, you will end up with the first <xml> node being <gender> which is a highly limiting path to follow for a search engine even though it may make the most sense for a human being.  You would then add data for teams, sports, colors, sizes, variants, materials of manufacture, and many other “parameters” for this data.  To avoid 3rd normal database limitation, you would start to peel this data out into separate tables… one for colors… one for teams…one for sports.  Then you would need to create many-to-many crosslink tables.  Over time, your table count just gets larger and larger as new needs arise.

The Root Object Classification

There is certain data that “hangs” off each sub-classification.  In this example the Item class stores who the manufacturer is (because most items have manufacturers).  The Apparel class contains the style information (because style is global to all apparel objects), whereas the Shirt class contains collar styles, sleeve variants, etc.

By localizing this information to class levels, once I define a “field” for the Apparel class, all future objects that inherit from that class will inherit that field.  Any objects that do not inherit from the Apparel class will not have the field at all.

Note how different this is from a traditional 3rd normal representation of data where we would have fields like “color1” and “color2” and “color3” simply to leave enough fields available just in case we might need them for a particular product application.

Maximum Flexibility for Customer Paths

Now that our data is structured with infinite flexibility while still retaining a core hierarchy (for default navigation purposes), when a customer walks into our store, we can simply ask Google “how they sent them” to us… and what keywords they used.  Now when the customer enters our “store” we can toss all of the inventory up into the air and literally rebuild our store to match the words they used in the order they used them.  Now they can enter as “ladies yellow tank top” and we structure our product data in terms of gender first, color next and product class third… but we also can welcome customers that ask for “white womens Nike shirt” which we do by scanning for aliases of class nodes, parent classes, and other permutations of the item for maximum comfort to the customer and higher conversion rates on sales.

Know a business that would benefit from our whitepaper on how Atomic Data Modeling can make search engine optimization possible?  Download it now:

02-Atomic-Data-Enables-Search-Engine-Dominance-by-FUZION.pdf (369.99 kb)

Jedi Mind Tricks for Business - "Luke... I am your Father" - SQL Recursion and Hierarchical Data Models

January 22, 2010 at 9:53 PMJared Nielsen

Covering topics from recursion in table valued functions, hierarchical data models, and identical node naming in XML hierarchies to fifth normal notation in data structures, CLR Stored Procedures, and many more topics specific to SQL Server 2008 and XML with C#.net programming, this lecture continues the popular Jedi Mind Tricks for Business series by Jared Nielsen at the South Florida Code Camp at the following location:

South Florida Code Camp 2010 - http://www.fladotnet.com/codecamp 

Devry University
2300 SW 145th Avenue
Miramar, FL 33027


View Larger Map

Jared Nielsen is an industry veteran with several decades of experience in sports marketing venues, business to business (B2B) commerce projects, and business to consumer (B2C) e-commerce and content management systems. He has been the lead software developer for the ATP Tour (Men’s Professional Tennis and Women’s Professional Tennis), Director of Business Intelligence for Football Fanatics (TeamFanShop), technical partner to Cook Marketing and Communications (for the Jaguars and Falcons contract), and now invests in online ventures such as Sports Mania (4 brick and mortar retail store locations), Team Sports Fan (http://www.teamsportsfan.com/), and other activities. His high profile projects include large projects for Yahoo! Sports, Interline Brands, and Big O Tires. He is a frequent lecturer and is always open to seminars and speaking engagements.

Call me today!
http://www.fuzion.org/
904-638-2455

Atomic Data Modeling and SEO Speech in Miramar Florida

June 24, 2009 at 7:53 PMJared Nielsen

I'm pleased to be speaking to the Miramar group of the Florida Dot Net group at www.FlaDotNet.com.  You can register for this event at the following website:  Click here to register.  I will be discussing how proper search engine capabilities start at the database level using atomic data modeling practices.  The samples of the atomic data model will include how to layer in object inheritance at the SQL Server level, utilizing some new features in SQL Server 2008 including the intrinsic Hierarcy data type and a nice overview of search engine techniques that can benefit from a highly optimized and atomic database.  I hope to see you there!

You can get a head start by reading my blog series on the topic at:

www.NielsenData.com - Atomic Data - Best Business Practices for Product Catalog Data

There are other resources that ascribe to the Atomic Data Modeling concept which you can find at:

Zimbio.com - The Atomic Data Warehouse

Wikipedia.org - Data Warehousing and the use of Atomic Data within the Data Mart

Other announcements of this event include:

Atomic Data Modeling and SEO Speech in Miramar Florida

June 24, 2009 at 7:53 PMJared Nielsen

I'm pleased to be speaking to the Miramar group of the Florida Dot Net group at www.FlaDotNet.com.  You can register for this event at the following website:  Click here to register.  I will be discussing how proper search engine capabilities start at the database level using atomic data modeling practices.  The samples of the atomic data model will include how to layer in object inheritance at the SQL Server level, utilizing some new features in SQL Server 2008 including the intrinsic Hierarcy data type and a nice overview of search engine techniques that can benefit from a highly optimized and atomic database.  I hope to see you there!

You can get a head start by reading my blog series on the topic at:

www.NielsenData.com - Atomic Data - Best Business Practices for Product Catalog Data

There are other resources that ascribe to the Atomic Data Modeling concept which you can find at:

Zimbio.com - The Atomic Data Warehouse

Wikipedia.org - Data Warehousing and the use of Atomic Data within the Data Mart

Other announcements of this event include:

Atomic Data - Best Business Practices for Product Catalog Data Structures - Part 1

October 29, 2008 at 9:42 AMJared Nielsen

This is the first installment in a series that blends website architecture, data structures, and SEO marketing into a collaborative design pattern.

Designing a product catalog is one of those "better get it right" projects that any e-commerce firm faces.  When you discuss lifespans of projects, this one has the longest lifespan of them all.  Since I've been through this a couple of times, I thought I would share my thoughts and designs as I delve into yet another one.

There are a lot of political and technical pressures put on a product catalog from many departments within an organization including IT, Marketing, Executive, Operations, and particularly the "Industry Expert" within any company.  It is important to not only recognize them, but to appreciate them.  At the end of the day, almost everyone is "right" in their desires to have the catalog data serve them in a certain way.  As you put yourself in their shoes by doing a proper discovery before you start designing you should try to not only understand what they want, but why they want it.

Atomic Data

Your marketing team will call this "flexibile product information", your IT team may call this "dynamic product data", but at the end of the day, it's product data that is smashed into all of its discrete component pieces.

This is one of the first pressures that will be placed on you and you need to be prepared to deal with it properly.  It is important to understand that there is a competing struggle in any database design... Flexible vs. Fast.  If you think of a product as a construction made from legos, then the properties of those products are the individual lego pieces.  The concept of "atomicity" means that you can assemble your lego construction with Red, Blue and Green legos to make a space ship... and then you can rearrange those same Red, Blue and Green legos and build a house.

Now you've all seen the non-atomic way of building a product.  It's a row in a product table and it tends to look like this:

 

You are limited however when you decide to stock a product that has a "Sub Sub Type", or a product that only has one color, or a product that has two vendor brands on it.

You also have a design flaw where you are "numbering instances" of properties.  In this case "Color1" and "Color2" are going to cause problems for you when you want to search by "Color".

There is also a failure to properly "atomize" the data with things like "SubDept" being equal to "Ladies Apparel".

Let's compare this model to one that is fully "fourth normal" or highly "atomic".

 

Lets analyze this model.  The product is statically registered in a much abbreviated product table.  It serves now primarily as a hook that you can hang things from.  We've decided to establish all of our atomic types as "Type", "Gender", "Vendor", "Brand", and "Color".  You can see how this can be reused.  For the "Live Strong Velocity Ladies Sport Top" it makes sense that Color (to this product) "means" White and Yellow... but to other products the same property of "Color" could "mean" other colors.

You can also see the intrinsic hierarchy here that establishes "Apparel" as a "top category" over "Top" and likewise, "Top" as a parent category over "Tank Top".  This enables you to still utilize hierarchies in your product data representations while granting you also the ability to search ad-hoc through your product data in a non hierarchical manner by using the raw properties.

 I have taken an apparel data model and created a good sample of how the property to product mappings for a decent catalog could be structured:

 

This model describes the relationship between products and properties but also illustrates some of the intrinsic relationships between the properties themselves.  For example, if you mapped a City to a product, you could "infer" what State and Country relationship existed by recursing through the Property-to-Property relationships.

So... which data model is right?  The answer could likely be ... Both!  It really depends on your requirements which we will discuss in Part 2 - Best Business Practices for Product Catalog Data Structures - Speed versus Flexibility.

  

Posted in: e-Commerce | Research Laboratory

Tags: , , , , , , , ,

Edge Caching Versus Dynamic Data - Best Practices for Product Catalog Data Structures - Part 2

October 29, 2008 at 9:34 AMJared Nielsen

This is the second installment in a series that blends website architecture, data structures, and SEO marketing into a collaborative design pattern continuing from Part 1 - Best Business Practices for Product Catalog Data Structures - Atomic Data 

We've discussed some ways you can create highly discrete or "atomic" data for a product in the first article.  This article will delve into how to evaluate the choices involved in speed versus flexibility.

Any database administrator that works on a high volume, high production website will simply start to quiver uncontrollably however, because there are severe implications for accessing this type of data scattered throughout several tables in a production environment.  Pass him a mug of decaf and let's walk together through how we can tackle the thorny issue of speed related to product catalog data.

We can start with our sample product that we have now mapped into its discrete elements.

 

This data is fairly granular (or atomic) and is highly reusable within its domain ("Color" categorically means a similar thing to every product that is bound to it).  There are many considerations when it comes to allowing Speed to dictate your design, but I'll list some of the top ones:

  • Static Edge Presentation vs. Dynamic Source Presentation
  • Precomputation or Data Summarization
  • Staged Caching or Static Publishing

Static Edge Presentation 

Static Edge Presentation refers to the concept that data that is requested through web pages goes through many stages.  One model that many people are familiar with is the following:

 

Generally when the first hit is generated for a distinct URL, such as http://www.domainname.com/?ID=5, the Data Server generates the data needed for the page, the Origin Web Server composes the data into a functional web page, and then the Edge Cache Server distributes that origin page into its "cache" where the unique page sits in "static" for all subsequent hits.  If the page is requested from hundreds of Client PCs after that, only the Edge Cache Server responds to the request (until its cache expires).  If a single Client PC hits refresh over and over again, depending on the Client PC settings, the page is instead served from the Client PC's Browser Cache, which is a local equivalent of server-based edge caching.  This is generally one of the more advanced methods of serving high volume pages in a fast manner (and in a way that the database is impacted the least).  This is the preferred shield which allows your data structures to be a bit more complex (read slow), because at the price of the initial render, the cost per page load is mitigated by the Edge Caching.

Take a page that requires 8 seconds to load.  This is generally considered "too heavy" of a page to be used in production environments.  However, this is only the Origin Page Render cost, meaning it only "costs" this much time for the very first load of that unique page.  If all subsequent page loads only take 0.5 seconds from the Edge Cache for all subsequent hits, then averaged over the numbers of hits, you can quickly see how the page load time continues to approximate the 0.5 seconds load time overall for the page.

Another model is the Dynamic Rendered Page which is far more common to most web developers and online businesses:

 

This model demonstrates the direct nature of the requests from the Client PC, straight to the Origin Web Server (which gets its data from the Data Server).  In this model, there is generally a one-to-one relationship between the "hit" and the "data request", so the load on the database server is relatively high.  There are tricks you can use to ameliorate this, including Origin Server Caching, SQL Dependency Caching, and other methods, but most implementations use this form of dynamic page delivery.  In this case, data structures that cause delays can severely impact the performance of the application.

 Take a page now, which due to its flatter data model, only costs a 3 second load time.  Because the Edge Cache has been removed from the architecture, your average page load time is going to remain 3 seconds (the page construction happens over and over again for each hit).  While you gain some flexibility by having constantly changing data available on the page, you pay in the overall load on your servers (up to six times more costly in time than an edge cached solution), and you also are forced into a far less flexible data model to compensate for the speed requirements of live rendered pages.

Precomputation

The concept of pre-computation is based on a similar concept as caching.  This means that pretty much anything your database is going to need to "think" about, can in many cases be "pre-thunk."  The art of pre-thinking things before they are needed involves storing what's been thought out and saving it somewhere.  You also have to factor in the speed of retrieving things... some methods of storage are faster than others.

The diagram below (Self Healing Data Retrieval) shows the "layers" that a data request goes through before a page can be rendered.  It's pretty clear that the fastest way to get data to the customer is when the customer asks for a webpage that has been "pre-thunk" already and is waiting in cache at the Edge Cache (Akamai for example).  Here's where the magic happens.  If the page is not available in cache, the Edge cache forwards the request to the Webserver.  The Webserver then can not only generate the page, but it "heals" the Edge Cache by delivering the new page so any subsequent hits to the same page are now "healed" and available on the Edge Cache again.

 

This type of failover I described above cascades all the way up to the top.  In the examples above, if the Edge Cache fails, the Webserver picks up the slack.  If the Webserver fails, then the Method Farm system checks to see if it has an XML representation of the data in memory (extremely fast).  If the Method Farm doesn't have it in memory, then the Edge Net Storage picks up the slack.  If the Edge Net Storage doesn't have the data, then the Method Farm checks to see if it has it saved in a file on the hard drive (pretty fast).  If the Method Farm doesn't have it written to disk, then the SQL Server attempts to pull a static, pre-generated copy from a static table.  If the static table doesn't have the data, then the SQL server regenerates the data.  In general the failover escalation follows this model:

  1. Edge Cache Static Copy
  2. Webserver XML
  3. Method Farm Memory XML
  4. Edge Net Storage XML
  5. Method Farm XML from file
  6. SQL Server Static Record
  7. SQL Server Dynamic Generation from data

In any of these cases, each step is design to "repair" the previous caller that failed.  This ensures that over time, the vast majority of requests are being serviced by the Edge Cache Server and approaches near 100% availability. 

Static Publishing

The last method of high volume, high speed retrieval of web pages that can help reduce load on database systems is the Static Publishing technique.  This means that without waiting for for a user to request a page, the system is designed to "spit out" every single possible page and page combination that could possibly be hit and this entire pile of page data is dumped onto an edge cache somewhere.  There is certainly some value to this, particularly for legacy media archives and other non-dynamic, and non-live page data, but it's use is extremely limited in the e-Commerce arena. 

This highlights to some degree the ways in which network and publishing architecture can drive decisions of data structures in general.  If you choose a more normalized method of data structure, then you need to compensate on the performance side with effective edge caching.  If you choose a more dynamic method of page delivery, then you need to look more toward a flatter, more static form of data model that can deliver the performance that you need.  Many database administrators will tell you that the atomic data model listed above (Sample Product to Property Association) may be too normalized for high volume use, but if the data being accessed is used to serve up pages for an edge cache architecture, the negative is eliminated.

It is important to factor in all of the requirements of your web project before making final data architecture decisions, but it is important to note that deficiencies in one decision (choosing a more normalized data structure) can easily be offset in other ways (choosing edge caching over dynamic page construction).  This may give you more freedom as you make your data structure and architecture choices.

Now that you have evaluated your choices of data models and a highly normalized method is a good architectural choice for your situation, it's prudent to examine the benefits of what the data model will enable you to do.  We will examine some of these benefits in Part 3 - Best Business Practices for Product Catalog Data Structures - Customer Paths.

  

Customer Paths - Best Business Practices for Product Catalog Data Structures - Part 3

October 29, 2008 at 9:31 AMJared Nielsen

This is the third installment in a series that blends website architecture, data structures, and SEO marketing into a collaborative design pattern continuing from Part 2 - Best Business Practices for Product Catalog Data Structures - Speed vs Flexibility 

Many e-Commerce projects begin with an existing brick and mortar store that has decided to go online.  This means that certain data models and business processes can be inherited from the legacy business processes of a non-online environment. 

If you were going to open a physical, brick and mortar store, you would generally design the store based on "Customer Paths", meaning you would examine the vector that a customer would take upon entering your store so you could direct them along the shortest path (in certain cases) to where they were trying to go to find the product that they wanted.  Many websites are designed along a similar path but the application of brick and mortar strategies to websites may not be the most effective.

Take for example the concept that an apparel store is designed along the Customer Path strategy of Departments, Aisles and Shelves.  An apparel store would generally have a Ladies department, with a Shirts Aisle and a Tank Top Shelf.  It would make sense from a Customer Path perspective to have (female) customers enter, segment them by Gender as they walk to the Ladies department, further segment them by Type as they walk to the Shirts Aisle, and further segment them by Type as they scan the Tank Tops Shelf.

This seems to work in practice, but only as long as you can only have a single store.  Take a customer now that is female but instead wants the Nike Shirts section.  Your demographic segmentation Customer Path does not cater to them properly and so the Customer is forced to scan through all shelves that have Shirts in order to find the Shirts that match the Nike Brand.  You can see how relying on a fixed hierarchy limits your store planogram and structure in a very singular manner.  To experiment with alternate Customer Paths, you would be forced to do a hard store reset, or you could experiment with alternate locations... perhaps a Nike Store which would provide a Brand-based alternative for the Brand-conscious customer.

Imagine now a website where instead of a fixed store with a rigid, hierarchical structure of Departments, Aisles and Shelves, you had a completely dynamic store that could be rebuilt in an instant and individually for each customer that entered for their own, private shopping experience.  Imagine also, those fixed Aisles and Shelves full of product, which instead of sitting in fixed placements, when a Customer entered the store the entire inventory was tossed into the air, only to fall back in the precise order that the Customer wanted to see them in upon entering.  This is no fantasy in an online e-Commerce website where this type of flexibility is possible.

Let's take a look a the Customer Path options open to an e-Commerce Apparel customer:

 

If you recall the Product to Property Mapping diagram shown in Part 2 - Best Business Practices for Product Catalog Data Structures - Speed vs Flexibility, you will see some of the same Property mappings in the above diagram.  These help to illustrate the product being mapped within the data model along the Customer Preference Paths instead of a fixed hierarchical model that a traditional brick and mortar store operator might follow.

For example, a customer that may be more interested in Tour de France could be immediately segmented in a store with inventory sorted by the Event Property first.  Then, if the customer was interested in the Brand Property next, the inventory would be tailored to suit by showing Nike merchandise.  Finally as the customer settled on a Tour Property related Product with UCI Pro Tour branding, the final product match is easily found because the inventory re-sorted itself to match the preconceived desires of the newly arrived customer.

Similarly, a customer that was more interested, at the time, in Lance Armstrong and then Tank Tops and then a color selection of White, could follow the Customer Path of Player / Type / Color.

You can see how the model continues.  Take some time to evaluate your own design process when you created your categorization model for your e-Commerce storefront.  Think about the process you went through as you decided on the model and see if you were trying to adapt a brick and mortar model to one that could have been conceived with an online presence in mind from the start.  If so, this may help guide you along a fresh look at the construction of a new categorization schema for your online e-Commerce catalog.

The series continues in Part 4 - Best Business Practices for Product Catalog Data Structures - SEO Path Aliasing

Posted in: e-Commerce | Research Laboratory

Tags: , , , , , , , , ,

SEO Path Aliasing - Best Business Practices for Product Catalog Data Structures - Part 4

October 29, 2008 at 9:28 AMJared Nielsen

This is the fourth installment in a series that blends website architecture, data structures, and SEO marketing into a collaborative design pattern continuing from Part 3 - Best Business Practices for Product Catalog Data Structures - Customer Paths.

It may seem counterintuitive to discuss search engine optimization (SEO) techniques in the midst of a conversation about data structures, architecture diagrams and in-store plan-o-grams, but it can directly relate to your choice of data models.  As we discussed in the previous article, it is important to structure your website to conform with the needs of entering customers in a way that segments them properly so they find the things that they were searching for.  Part of this is anticipating what a customer is going to want before they enter your store. 

When dealing with search engines, there are two customers to contend with... the "Natural" search engine... and the "Paid" search engine.  These two customers are very important to understand and to distinguish and need to be treated with a deference and distinction from the "real" customers that frequent your online store.  The complexity arises to some degree because these two "customers" happen to be "ghost shoppers".  You never know when they are going to arrive and they generally float through your store much like a customer would, but they are searching for every product on every shelf in every aisle and in every department... all at the same time.  The complications continue because you want to manage what the ghost shoppers can and cannot see so they don't memorize portions of the store that you don't want reported on the search engines.  This may come across as elemental theory to an SEO expert, but in the context of blending SEO concepts, architecture and data structure modeling, it illustrates one aspect of the equation.

Imagine now that you are a search engine, whose job is to find, identify and classify billions of e-commerce pages throughout the Internet with the primary objective of finding pages that are considered "relevant."  I quote the term "relevant" because what that precisely means changes with the breeze and the whim of arcane departments of voodoo at the various search engine optimization firms.  With that said, you want to look at a natural search engine as a stream of water pouring into your website.  This stream is going to remember whatever it touches, so you want to ensure that it finds the things that you want it to see.  You also need to consider the diffusion of the stream of water as well.  Don't let the natural search engine stumble across pages like "Privacy Policy" or "Terms & Conditions" as that won't deliver any tangible benefit for you.  In similar fashion, on your landing pages you should try to structure your site so the links that are the most compelling draws for the majority of natural searching customers should be setup to receive the largest stream of natural search "attention." 

You also need to anticipate every possible combination of keywords that would be used to "land" on any given destination.  Lets take a look at the SEO Path Aliasing diagram to illustrate that:

 

We have already covered Customer Paths but sometimes the proper "path name" doesn't match an actual English phrase.  This means that the combinations of words that make sense for categorizing a mix of products may not make linear sense for a keyword search.  Our diagram above illustrates this with the green path of "Ladies / Nike".  There may not be many customers that would enter that phrase in a search, but it may be a logical progression as they navigate through a website.  This is where Aliased Paths come in.  In our example, the Aliased Path for "Ladies / Nike" could be "Ladies Nike Apparel"... sure this one is a bit of a stretch...  I'm not sure how many actually type in the word "apparel" but you'll need to work with me on this one.

You will note that this path is identified as "overridden".  In smaller e-Commerce websites, it may be a simple matter to manually go through each Customer Path and identify the possible Aliases but in far larger catalogs this quickly becomes a daunting task.  It doesn't mean that overridden Path Aliases aren't an important part of configuring your catalog categorization scheme, but you can, for the most part, rely on the auto-generated Path Aliases for many of the Customer Paths in your catalog.  Take the path "UCI Pro Tour / Tank Tops" which easily converts to an English text keyword search of "UCI Pro Tour Tank Tops". 

Note also our attempt to focus the "stream" of the natural search flow throughout the various Customer Paths.  Many search engines respond to a setting within the hyperlinks of a "NOFOLLOW".  This mechanism gives you some measure of control over which links you allow the natural search "probing" to find.  You will note how the various Customer Paths are identified as NOFOLLOW for those paths that we want the search engines to pass on as they traipse through our pages.  This poses another logistical issue in a large-scale e-Commerce website which we will address in the next segment, Part 5 - Best Business Practices for Product Catalog Data Structures - SEO Weighted Auto Mapping

SEO Weighted Auto Mapping - Best Business Practices for Product Catalog Data Structures - Part 5

October 29, 2008 at 9:25 AMJared Nielsen

This is the fifth installment in a series that blends website architecture, data structures, and Search Engine Optimization (SEO) marketing into a collaborative design pattern continuing from Part 4 - Best Business Practices for Product Catalog Data Structures - SEO Path Aliasing.

We have discussed custom-tailoring a website's NOFOLLOW and Path Aliases to tightly tune the "stream" of natural search flow throughout your website.  By tuning what the search engines "see" you will be able to help your search engine rankings climb for the pages that you most care about.  In large scale e-Commerce platforms, it becomes an onerous task to keep up with all of these customizations.  Here is another case where your choice of an atomic data model will serve to automate some of these functions.

Let's examine the following model of SEO Weighted Auto-Mapping:

 

Here is a scenario where we assign "weights" to the various Property nodes that can be mapped to Products.  Once the weights are assigned, we can develop custom business rules that will help us "scale up" or "scale down" our Weighted Path's sensitivities to the search engines (through the use of NOFOLLOW tags).

We can roll back to the original case of a standard brick and mortar store that was the basis of our e-Business (for example).  In a traditional brick and mortar business, let's say that we determined that in general, segmenting our customers by Gender tended to be the most common and most popular means of diverting our customer traffic.  This could give us a clue on our e-Commerce website on what weight to assign to the Gender Property.  Since this Property holds primacy over the rest of the Properties in our categorization scheme, we could assign it with a high "weight" value.

Take our example above where we have decided that the Player Property is the highest ranking "Path" starting point in our categorization schema.  This is essentially because, in the cycling apparel business, Lance Armstrong (the keyword phrase) drives a significant portion of our prime traffic.  It also tends to be a highly competitive term that we would like a high search ranking for.  Additionally, it is a phrase that we would like to channel a lot of natural search traffic through, even to the exclusion of other lower performing phrases that have a significantly lower revenue opportunity.  For this, we assign the Player Property (regardless of the specific Player identified) a weight of 10.  This means that a customer that "lands" on the Lance Armstrong Player landing page who directly orders a product is defining the primary Customer Path that we are interested in promoting and that path gains a score of 10 / 1 (hop) which averages out to a 10 (no surprise).  Any links to this particular URL do NOT receive the NOFOLLOW parameter and the natural search engines will stream most of their energy through links like this.

We also have the option of defining our business logic for what rules we want to apply.  One example is how we set the threshold for NOFOLLOW parameter placements.  We have decided in the above example to set NOFOLLOW parameters on any Customer Paths that rank less than an average of 10.  Effectively we are deciding that we want the full "stream" of the natural search engines to flow through these highly weighted paths, which will tend to be very direct links through Products mapped to the Player Property.

We can layer in other business rules as well.  One business rule that we are using in the above model is the method of computing a multi-step Customer Path "weight".  In the example above, we simply decided to add the cumulative weights of all "hops" in each Customer Path and divide by the number of hops.  Take the Customer Path of "Tank Tops / Ladies / Cycling / Lance Armstrong".  Each "hop" as the customer steps through that path adds to our total and because there are four hops along the path, we divide the total (34) by the hops (4) and come up with an overall weight of 8.5.  This business rule may be subject to some review.  It seems that an alternative formula might be to reduce each hop's weight by the "distance" from the initial starting point.  This would then be 8 + (7-1) + (9-2) + (10-3) = 28 / 4 = 7.  However you decide to "compute" the average weight of any given Customer Path, it should make sense for your business while delivering some automation where possible for the NOFOLLOW mappings within your categorization scheme.

This demonstrates yet another possible use of blending the choice of data structure with your requirements for SEO initiatives.  We can explore more methods of integrating data models with Search Engine Optimization techniques in Part 6 - Best Business Practices for Product Catalog Data Structures - Search Optimization.

Using Captcha to Stop Verisign Spamming - Best Practices for e-Commerce Storefront Security - Part 1

October 29, 2008 at 8:56 AMJared Nielsen

In an e-commerce environment, there is a constant struggle between polar opposites that must be balanced and adjusted at all times.  Some of those include:

  • Customer Experience vs. Search Engine Experience
  • Marketing Flexibility vs. IT Standardization
  • Ease of Use vs. Security

When dealing with Ease of Use versus Security, you run into very interesting issues, one of which is the ability for a customer to come into your e-commerce storefront and validate their credit card.  This simple transaction sounds great from a customer service perspective (who wouldn't want a customer to be able to know if their credit card is good or not?) but it becomes complicated in an online environment.  A brick and mortar store has to contend with credit card fraud, but they at least have the privilege of looking the customer in the eye and seeing their credit card.  They also can only fit one person at the counter at a time, which means the rate at which credit validations occurs is inherently limited or throttled.

Online this becomes an entirely different matter.  It may cost $0.25 per "validation" to have VeriSign respond whether a credit card is valid or not and an adept hacker can slam through thousands of cards per second.  You also can never truly be certain if they are holding that worn plastic card or they simply bought the list of credit card numbers from an offshore swindler. 

The customer service benefit of saying, "Thank you but your credit card is not valid" now becomes an offensive weapon that a credit card hacker can use to sift through a list of made-up cards to determine which ones are valid or not.  Your obligation as a good internet commerce citizen is to stop this behavior if you can, or at a minimum, slow them down and deliver as little information as possible.

What does a hacker stand to benefit from Verisign Spamming?  There are many obvious benefits and some that aren't so clear.  Here is a diagram that demonstrates the many "openings" available to a Verisign spamming hacker:

 

In crowded neighborhoods where traffic is a problem, they use speed bumps... and the Internet equivalent of that is the "captcha".  The word CAPTCHA is actually an acronym derived from the following: "Completely Automated Public Turing test to tell Computers and Humans Apart".  A Turing Test is more aptly described in Wikipedia if you're curious...

Generally a CAPTCHA authentication can be very complex or relatively simple.  For the purposes of a "speed bump" however, (and assuming you aren't working for the Nuclear Regulatory Commission) you don't need to do anything over the top.  Here's a simple diagram of the standard payment process:

 

Naturally, a hacker can set a script to run through the above process with reckless abandon, either exploiting loose account creation rules, penetrating an existing registered user account, or abusing the anonymous checkout option for users.  Once their script has entered the payment process, they can automate rapid-fire capture authorizations through the VeriSign validation mechanism in an effort to see which ones branch off to the "invalid card" page or the "valid card" page.  Even a nuanced error message that attempts to kindly inform the customer of which field was entered incorrectly will deliver vital information to the hacker so they know what was (or wasn't) valid.  Not only do VeriSign validations cost a quarter or so per "hit", depending on the order processing rules, you can even end up with tentative amounts frozen on the credit cards even if the transaction is rejected (an "Auth Capture").

 This unfettered access needs a "traffic cop" to regulate the ease with which an automated script can burn through the payment and credit card validation procedure.  Note the below implementation of a Captcha validator as a "speed bump" in the checkout process:

 

At a minimum, this prevents the hacker from obtaining an efficient and rich dataset from their list of stolen (or generated) credit card numbers and slows down their automated processing enough that hopefully they will move along to another website victim.  Please note that this isn't the only mechanism to secure your e-commerce platform... just one of the pieces of the puzzle.

Generating the Captcha involves a few simple steps which I will outline below.  I've been accused of being far too technical in these articles, so the remainder may be something you'll care to avoid...

The purpose of a captcha is to obscure the borders of the letters and numbers so an automated OCR (Optical Character Recognition) routine can't read the text without a human being involved.  The image also must be a single, complete image and not text overlaid in HTML over an image background so the text is obfuscated in the binary image file rather than "scrapeable" by some bot.  Once the image is generated, its text must be matched on the other side with the "key" text that matches what was embedded into the image.  If the customer enters the proper Captcha code, then the website can continue.

The first step is to create the table that will house the valid pool of Captcha numbers in SQL Server 2005:

CREATE TABLE [Captcha](     [ID] [uniqueidentifier] ROWGUIDCOL  NOT NULL CONSTRAINT [DF_Captcha_ID]  DEFAULT (newid()),     [Status] [int] NULL,  CONSTRAINT [PK_Captcha] PRIMARY KEY CLUSTERED (     [ID] ASC )

The second step is to create the stored procedure that will generate the Captcha code.  I'm going to borrow a t-SQL script for SQL Server 2005 that I used to demonstrate the passing of XML parameter data in lieu of normal parameters to stored procedures as shown below:

CREATE PROCEDURE [Captcha]     @Parameters xml AS SET NOCOUNT ON; ------------------------------------------------------------------------------------------------------------------ -- Valid XML Parameter Definitions ------------------------------------------------------------------------------------------------------------------ --    <security type="1" version="1"> --        <request> --            <!-- Methods --> --            <parameter name="Action" class="method">Get</parameter> --            <!-- Constants  - Settings --> --            <parameter name="Precision" class="constant">7</parameter> --        </request> --    </security> ------------------------------------------------------------------------------------------------------------------ --  Valid Security Values           type           version ------------------------------------------------------------------------------------------------------------------ --                                  1-captcha      1-current ------------------------------------------------------------------------------------------------------------------ --  Valid Parameter Values          class          match          id               name ------------------------------------------------------------------------------------------------------------------ --                                  table          starts-with    GUID identifier  (listed above) --                                  constant       ends-with --                                  property       equals --                                  method         contains ------------------------------------------------------------------------------------------------------------------ --  Valid Method Values             Actions ------------------------------------------------------------------------------------------------------------------ --                                  Get --                                  Validate --                                  Regenerate ------------------------------------------------------------------------------------------------------------------ -- TEST CASES ------------------------------------------------------------------------------------------------------------------ -- Standard Captcha Request ------------------------------------------------------------------------------------------------------------------ /* exec Captcha '     <security type="1" version="1">         <request>             <parameter name="Action" class="method">Get</parameter>             <parameter name="Precision" class="constant" match="equal-to">7</parameter>         </request>     </security>' */ ------------------------------------------------------------------------------------------------------------------ -- Standard Captcha Validate ------------------------------------------------------------------------------------------------------------------ /* exec Captcha '     <security type="1" version="1">         <request>             <parameter name="Action" class="method">Validate</parameter>             <parameter name="Token" class="constant" match="equal-to">47842F8</parameter>         </request>     </security>' */ ------------------------------------------------------------------------------------------------------------------ -- Standard Captcha Regenerate ------------------------------------------------------------------------------------------------------------------ /* exec Captcha '     <security type="1" version="1">         <request>             <parameter name="Action" class="method">Regenerate</parameter>         </request>     </security>' */ ------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------- -- Temp Table Variable to Store incoming Parameters --------------------------------------------------------------------------------     DECLARE @tParameters TABLE         (             [Name] [nvarchar](64) NULL,             [Class] [nvarchar](32) NULL,             [Match] [nvarchar](32) DEFAULT 'contains' NULL,             [Value] [nvarchar](max) NULL,             [ID] [uniqueidentifier] NULL         ) -------------------------------------------------------------------------------- -- Deposit Incoming Parameters into temp Table Variable --------------------------------------------------------------------------------     INSERT INTO @tParameters ([Name], [Class], [Match], [Value], [ID])     SELECT           P.parameter.value('@name', 'nvarchar(64)')         , P.parameter.value('@class', 'nvarchar(32)')         , P.parameter.value('@match', 'nvarchar(32)')         , P.parameter.value('.', 'nvarchar(max)')         , P.parameter.value('@ID', 'uniqueidentifier')     FROM @Parameters.nodes('/captcha/request/parameter') P(parameter) --------------------------------------------------------------------------------     IF 'get' = Lower((SELECT Value FROM @tParameters WHERE Name='Action' and class='method'))     BEGIN         SELECT TOP 1 LEFT(REPLACE(C.ID,'-',''),P.value)         FROM Captcha C             , @tParameters P         WHERE P.Name='Precision'         ORDER BY NEWID()         RETURN     END --------------------------------------------------------------------------------     IF 'validate' = Lower((SELECT Value FROM @tParameters WHERE Name='Action' and class='method'))     BEGIN         SELECT COUNT(C.ID)         FROM Captcha C         WHERE C.ID LIKE (SELECT Value FROM @tParameters WHERE Name='Token') + '%'         RETURN     END --------------------------------------------------------------------------------     IF 'regenerate' = Lower((SELECT Value FROM @tParameters WHERE Name='Action' and class='method'))     BEGIN         DELETE Captcha         INSERT INTO Captcha (status)         SELECT TOP 100 1 FROM product  -- or other table containing at least 100 rowsets         SELECT COUNT(ID) TokenCount FROM Captcha         RETURN     END --------------------------------------------------------------------------------

This stored procedure encapsulates the three valid methods that can be used for Captcha validation:

  1. Get a random Captcha token from the authorized pool of valid Captcha numbers.
  2. Test an incoming token against the authorized pool of valid Captcha numbers
  3. Regenerate the pool of valid Captcha numbers (to foil any scripts that happen to figure out some of the codes)

Now that we have deployed the table, the stored procedure is ready to "take our orders", and we have the Captcha background in place ()... then we are ready to deploy the code necessary to generate the image, merge it with the valid Captcha token, and display it.

Note we are using the System.Drawing namespace here to read in the background, overlay the image with a "watermark" containing the Captcha token that is retrieved from the database, and then delivering that in a binary stream.  If this .aspx file were called "Captcha.aspx" then the resulting image that referenced this in HTML would be <img src=captcha.aspx"/>.

using System;using System.Configuration;using System.Data;using System.Data.SqlClient;using System.Drawing;using System.Drawing.Drawing2D;using System.Drawing.Imaging;using System.IO;public partial class captcha : System.Web.UI.Page{ //Default DataSource private static string connectionString = ConfigurationManager.ConnectionStrings["database"].ConnectionString; protected void Page_Load(object sender, EventArgs e) { const string NOT_AVAILABLE_IMG_NAME = "NotAvailable.jpg"; try { Image image = Image.FromFile(Server.MapPath("captchaBackground.jpg")); image = image.GetThumbnailImage(image.Width, image.Height, new Image.GetThumbnailImageAbort(ThumbnailCallback), IntPtr.Zero); image = AddWatermark(image, 0, 10); Response.ContentType = "image/jpeg"; image.Save(Response.OutputStream, ImageFormat.Jpeg); } catch { Image errorImage = Image.FromFile(Server.MapPath(NOT_AVAILABLE_IMG_NAME)); Response.ContentType = "image/jpeg"; errorImage.Save(Response.OutputStream, ImageFormat.Jpeg); } } private static Image AddWatermark(Image currImage, Int32 targetWidth, Int32 targetHeight) { Graphics g = Graphics.FromImage(currImage); Font watermarkFont; String watermarkText = GetData(); watermarkFont = new Font("Verdana", 14, FontStyle.Bold); g.InterpolationMode = InterpolationMode.HighQualityBicubic; g.DrawImage(currImage, 0, 0, targetWidth, targetHeight); g.SmoothingMode = SmoothingMode.HighQuality; SolidBrush foregroundText = new SolidBrush(Color.FromArgb(100, 0, 0, 0)); SolidBrush backgroundText = new SolidBrush(Color.FromArgb(50, 255, 255, 255)); g.DrawString(watermarkText, watermarkFont, backgroundText, Convert.ToSingle(targetWidth / 2), Convert.ToSingle(targetHeight / 2)); g.DrawString(watermarkText, watermarkFont, foregroundText, Convert.ToSingle(targetWidth / 2) + 2, Convert.ToSingle(targetHeight / 2) - 2); return currImage; } static Boolean ThumbnailCallback() { return true; } private static String GetData() { SqlConnection conn = null; try { conn = new SqlConnection(connectionString); conn.Open(); SqlCommand cmd = new SqlCommand("Captcha", conn); cmd.CommandType = CommandType.StoredProcedure; String parameters = "<captcha><request><parameter name=\"Action\" class=\"method\">Get</parameter><parameter name=\"Precision\" class=\"constant\" match=\"equal-to\">7</parameter></request></captcha>"; cmd.Parameters.Add(new SqlParameter("@Parameters", parameters)); return (String)cmd.ExecuteScalar(); } finally { if (conn != null) { conn.Close(); } } }}

There is more work to be done to integrate this Captcha image into the page and check against the stored procedure shown above to validate it... and apply some business logic, but I hope this article has given you a sense of how to generate the Captcha, and what it's purposes are. 

There are many other resources for professionally designed Captcha integrations including some fun ones I liked (and some "gotchas" I thought were funny...):

   

Posted in: e-Commerce | Security

Tags: , , , , , , , , , , ,

I'm Human... You can talk to me personally...

I am available for speaking engagements and book signings. If you are interested in having me attend your tradeshow, corporate training event, or media event please contact me directly using the below form.