|
purebill.com Stephen Jones writing on billing and application migration |
![]() |
| . | Home | . | About | . | Archive | . | Links | . | Billing | . | Reference | . | Subscribe | . | Search | . | . |
Column - 20 July 2009 Managing the 'case' of alphanumeric keys and identifiersSummaryKey alphanumeric identifiers such as email addresses, user names and hardware references (e.g. MAC addresses) are often used as indexes in database tables, and can have problems when their case, that is uppercase, lowercase or mixed case, impacts the application's retrieval of or access to the identifiers. For example, without prior processing, identifiers loaded as their uppercase equivalent values will not be found when searched for using an exact match against a lowercase equivalent. This complication is compounded when identifiers are passed to external/downstream systems, not just retained and used internally. Three approaches to addressing this situation are:
SituationIdentifiers such as a user name can be entered in one case, for example mixed case, in the form of 'StephenJones', but later processing might retrieve the same user name using a different case, for example using a user entered lower case form of 'stephenjones'. Users performing searches and lookups are likely to type the full range of upper, lower and mixed case entries, will varying what they type over time, and will expect their data is found reliably. In addition, identifiers and keys passing through upstream applications may have their case modified without consultation, and these unasked for modifications should not cause the local application's processing to fail. ProblemsWhen identifiers are passed from upstream into your application's local processing and must be passed downstream or returned to the originating system with their case preserved, your local application must not modify the identifiers' case or the identifiers will not match the originating system's stored values. Of course, not all up and downstream systems will be case sensitive, but many are and the defensive design presumption should be that future interfaces will be. The problems that appear with case occur when not all systems in an domain ecosystem handle carefully the case of their identifiers. An identifier's entire ecosystem must preserve and take the same approach to case preservation, or all systems must treat the case as nothing special. If any one system stops preserving case, it is likely to pass on the incorrect case, for example an upper cased identifier to other (downstream) systems, which may then have a problem matching against their stored identifiers. This is a Yoda situation - do or do not, there is no try. Preserve case completely, or preserve it not at all. TimingCase preservation must start from a database's initial population, and generally cannot be retrofitted once field population has commenced. There may be some such circumstances where it is known that all existing identifiers are lowercase or of a certain construction by identifier type, but it is unwise to rely on this being the case across all fields across time. If database population has commenced, a refresh of the database from an external point of truth may be possible, allowing fields that were initially stored with an incorrect case to be stored correctly. This approach presumes that a consistent and canonical version of the identifier is available elsewhere, is easily passed again to the downstream application, and that the act of receiving such a refresh will not adversely impact the receiving application (e.g. performance degradation, increased storage requirements, network congestion as data is retransmitted, duplicate records presented to end customers or staff). SearchingAn example is a username key stored away as 'StephenJones', but when a user later searches for it simply as 'stephenjones' a simple string search or lookup will fail as the two strings do not match, and the search will therefore be seen as unsuccessful. This will return the 'false' result to the user's search that 'StephenJones' / 'stephenjones' is not present. An extension of this is to retain both the identifier's original form ('as passed'), such as 'StephenJones', and also store it in a separate field/columns in (say) an uppercased form for use in searching. When retrieved, the original form can used in processing, preserving the original case, and ensuring that identifiers match across different applications storing the same identifier. By pre-standardising the search identifier, a search will find a string even when entered as 'StephenJones', 'stephenjones', or 'STEPHENJONES'. Where different cased strings for the same identifier type are passed legitimately to an application and must be stored separately, the list of possible matches can be returned and the user can select from them (or an additional field is required to distinguish between them). Performance ImpactsWhilst command options and special functions can perform case insensitive matching, this functionality can generate substantial and unnecessarily repeated workloads, which will become a bigger problem as database sizes or transaction loads increase. The approach outlined above incurs a small one-time performance hit (case conversion) at the point of insertion, and a modest additional data storage cost, but the long-term processing cost results in an equivalent retrieval cost to the regular lookup and case conversions are avoided. This is a trade-off of space (storage of additional columns and their indexes), for CPU/time (required for case conversion). The three alternatives outlined at the beginning of this text all perform direct string matches with no additional processing per access, but they have different application / functional trade-offs such as missed matches, or loss to downstream applications of an identifier's original case. Tags: Case, Identifier, Key, Index, Search [ Share with others ] Post this page to a social bookmarking site:
Other 'purebill' columnsPrevious column: Stretch Key Dimensions to See What Breaks Next column: Defend revenue in depth - Credit Assessment, Fraud and Credit Management All previous purebill columns can be found in the archive section. Recent Updates
Sign up to receive a brief text email when a new purebill column is published. JUMP TO TOP
|
. |
| Comments welcome: stephenjones(at)purebill.com | Stephen Jones © 2004-2010 - Copyright and reprint rules | Sitemap | . |