Quality
Stage in Data Stage
Data Quality Challenges
- Different or inconsistent standards in
structure, format or values
- Missing data,
default values
- Spelling errors,
data in wrong fields
- Buried
information
- Data myopia
- Data anomalies
Different or Inconsistent
Standards
Missing Data & Default Values

Buried Information

The Anomalies Nightmare

Quality Stage
Quality Stage is a tool
intended to deliver high quality data required for success in a
range of enterprise
initiatives including business intelligence, legacy consolidation and
master data management. It
does this primarily by identifying components of data that
may be in columns or free
format, standardizing the values and formats of those data,
using the standardized
results and other generated values to determine likely duplicate
records, and building a “best
of breed” record out of these sets of potential duplicates.
Through its intuitive user
interface Quality Stage substantially reduces time and cost to
implement Customer Relationship
Management (CRM), data warehouse/business
intelligence (BI), data
governance, and other strategic IT initiatives and maximizes their
return on investment by
ensuring their data quality.
With Quality Stage it is
possible, for example, to construct consolidated customer and
household views, enabling
more effective cross-selling, up-selling, and customer
retention, and to help to
improve customer support and service, for example by
identifying a company's most
profitable customers. The cleansed data provided by Quality Stage allows
creation of business intelligence on individuals and organizations for
research, fraud detection, and planning.
Out of the
box Quality Stage provides for cleansing of name and address data and some
related
types of data such as email addresses, tax IDs and so on. However, Quality
Stage
is fully
customizable to be able to cleanse any kind of classifiable data, such as
infrastructure,
inventory, health data, and so on.
Quality Stage Heritage
The product
now called Quality Stage has its origins in a product called INTEGRITY from a
company
called Vality. Vality was acquired by Ascential Software in 2003 and the
product
renamed to Quality Stage. This first version of Quality Stage reflected its
heritage
(for example it only had batch mode operation) and, indeed, its mainframe
antecedents
(for example file name components limited to eight characters).
Ascential
did not do much with the inner workings of Quality Stage which was, after all,
already a
mature product. Ascential’s emphasis was to provide two new modes of
operation
for Quality Stage. One was a “plug-in” for Data Stage that allowed data
cleansing/standardization
to be performed (by Quality Stage jobs) as part of an ETL data
flow. The
other was to provide for Quality Stage to use the parallel execution technology
(Orchestrate)
that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired
Accentual Software at the end of 2005. Since then the main direction has
been to put
together a suite of products that share metadata transparently and share a
common set
of services for such things as security, metadata delivery, reporting, and so
on. In the
particular case of Quality Stage, it now shares a common Designer client with
Data Stage:
from version 8.0 onwards Quality Stage jobs run as, or as part of, Data Stage
jobs, at least in the parallel execution environment.
QualityStage Functionality
Four tasks
are performed by QualityStage; they are investigation, standardization,
matching and
survivorship. We need to look at each of these in turn. Under the covers
QualityStage
incorporates a set of probabilistic matching algorithms that can find
potential
duplicates in data despite variations in spelling, numeric or date values, use
of
non-standard
forms, and various other obstacles to performing the same tasks using
deterministic
methods. For example, if you have what appears to be the same
employee
record where the name is the same but date of hire differs by a day or two, a
deterministic
algorithm would show two different employees whereas a probabilistic
algorithm would show the potential duplicate.
(Deterministic
means “absolute” in this sense; either something is equal or it is not.
Probabilistic
leaves room for some degree of uncertainty; a value is close enough to be
considered
equal. Needless to say, the degree of uncertainty used within QualityStage
is configurable by the designer.)
Investigation
By
investigation we mean inspection of the data to reveal certain types of
information
about those
data. There is some overlap between Quality Stage investigation and the
kinds of
profiling results that are available using Information Analyzer, but not so
much
overlap as
to suggest that removal of functionality from either tool. Quality Stage can
undertake
three different kinds of investigation.
Features
- Data investigation is done using the
investigate stage
- This stage analyzes each record field by
field for its content and structure.
- Free form fields are broken up into
individuals and then analyzed.
- Provide frequency distributions of distinct
values and patterns
- Each investigation phase produces pattern
reports, word frequency reports and word classification reports. The
reports are located in the same data directory of the server.
Investigate methods


Character Investigation
Single-domain fields
- Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
- Entity Clarifiers:
Eg: name prefix, gender, and marital
status.
Multiple-domain fields
- large free-form fields such as multiple
Address fields.

Character discrete investigation: looks at the characters in a single field (domain) to
report what
values or patterns exist in that field. For example a field might be expected
to contain
only codes A through E. A character discrete investigation looking at the
values in
that field will report the number of occurrences of every value in the field
(and
therefore
any out of range values, empty or null, etc.) “Pattern” in this context means
whether each
character is alphabetic, numeric, blank or something else. This is useful in
planning
cleansing rules; for example a telephone number may be represented with or
without
delimiters and with or without parentheses surrounding the area code, all in
the one
field. To come up with a standard format, you need to be aware of what
formats
actually exist in the data. The result of a character discrete investigation
(which
can also
examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer
determines which.
Character concatenate investigation is exactly the same as character discrete
investigation
except that the contents of more than one field can be examined as if they
were in a
single field – the fields are, in some sense, concatenated prior to the
investigation
taking place. The results of a character concatenate investigation can be
useful in
revealing whether particular sets of patterns or values occur together.
Word investigation :is probably the
most important of the three for the entire
QuialityStage
suite, performing a free-format analysis of the data records. It performs
two
different kinds of task; one is to report which words/tokens are already known,
in
terms of the
currently selected “rule set”, the other is to report how those words are to
be
classified, again in terms of the currently selected “rule set”. There is no
overlap to
Information Analyzer (data profiling tool) from word investigation.
Rule Set :
A rule set includes a set of tables that
list the “known” words or tokens. For example,
the GBNAME
rule set contains a list of names that are known to be first names in Great
Britain,
such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule
set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only
be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal
additional information, such as gender.
When a word
investigation reports about classification, it does so by producing a
pattern.
This shows how each known word in the data record is classified, and the order
in which
each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III
would report the pattern FI?G – the F indicates that
“William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in
context, and the G indicates that
“III” is a “generation” – as would be “Senior”,
“IV” and
“fils”. Punctuation may be included or ignored.
Rule sets
also come into play when performing standardization (discussed below).
Classification
tables contain not only the words/tokens that are known and classified,
but also
contain the standard form of each (for example “William” might be recorded as
the standard
form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity”
might still be recognizable as “Felicity” even though it is misspelled in the
original
data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation
might also be performed to review the results of standardization,
particularly
to see whether there are any unhandled patterns or text that could be
better
handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.
Standardization
Standardization,
as the name suggests, is the process of generating standard forms of
data that
might more reliably be matched. For example, by generating the standard
form “William”
from “Bill”, then there is an increased likelihood of finding the match
between
“William Gates” and “Bill Gates”. Other standard forms that can be generated
include
phonetic equivalents (using NYSIIS and/or Soundex), and something like
“initials” –
maybe the first two characters from each of five fields.
Each
standardization specifies a particular rule set. As well as word/token
classification
tables, a
rule set includes specification of the format of an output record structure,
into
which
original and standardized forms of the data, generated fields (such as gender)
and
reporting
fields (for example whether a user override was used and, if so, what kind of
override) may be written.
It may be
that standardization is the desired end result of using Quality Stage. For
example
street address components such as “Street” or “Avenue” or “Road” are often
represented
differently in data, perhaps differently abbreviated in different records.
Standardization
can convert all the non-standard forms into whatever standard format
the
organization has decided that it will use.
This kind of
Quality Stage job can be set up as a web service. For example, a data entry
application
might send in an address to be standardized. The web service would return
the
standardized address to the caller.
More
commonly standardization is a preliminary step towards performing matching.
More
accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.
Standardization Process
Flow

Delivered Rule Sets Methodology in Standardization

Example: Country Identifier Rule Set

Example: Domain Pre-processor Rule Set

Example: Domain Specific Rule Set

Logic for NAME Rule Set
- Set variables for process option delimiters
- Process the most common patterns first
- Simplify the patterns
- Check for common patterns again
- Check for multiple names
- Process organization names
- Process individual names
- Default processing (based on process
options)
- Post process subroutine to populate
matching fields
Logic of ADDR Rule Sets
- Process the most common patterns first
- Simplify the patterns
- Check for common patterns again
- Call subroutines for each secondary address
element
- Check for street address patterns
- Post process subroutine to populate
matching fields
Logic of AREA Rule Sets
- Process input from right to left
- Call subroutines for each sub-domain (i.e.
country name, post code, province, city)
- Post process subroutine to populate
matching fields

Rule Sets
- Rule Sets are standardization processes
used by the Standardize Stage and have three required components:
1. Classification Table –
Contains the key words that provide special context, their standard value, and
their user-defined class
2. Dictionary File –
Defines the output columns that will be created by the standardization process
3. Pattern-Action File –
Drives the logic of the standardization process and decides how to populate the
output columns
- Optional rule set components:
4 User Overrides
Reference
Tables
Standardization Example

Parsing (the Standardization Adventure Begins…)
- The standardization process begins by
parsing the input data into individual data elements called tokens
- Parsing parameters are provided by the
pattern-action file
- Parsing parameters are two lists of
individual characters:
4 SEPLIST - Any character
in this list will be used to separate tokens
4 STRIPLIST - Any
character in this list will be removed
- The SEPLIST is always applied first
- Any character that is in the SEPLIST and not
in the STRIPLIST, will be used to separate tokens and will also become a
token itself
- The space character should be included in
both lists
- Any character that is in both lists will be
used to separate tokens but will not become a token itself
4 The best example of this
is the space character - one or more spaces are stripped but the space
indicates where one token ends and another begins

Parsing (Chinese, Japanese, Korean)
- The parser behaves differently if the
locale setting is Chinese, Japanese, or Korean
- Spaces are not used to divide tokens so
each character, including a space, is considered a token
- Spaces are classified by underscores (_) in
the pattern
- The Classification file allows multiple
characters to be classified together
- Latin characters are transformed to double
byte representations
Classification
- Parsing separated the input data into
individual tokens
- Each token is basically either an
alphabetic word, a number, a special character, or some mixture
- Classification assigns a one character tag (called
a class) to each and every individual parsed token to provide context
- First, key words that can provide special
context are classified
4 Provided by the
standardization rule set classification table
4 Since these classes are
context specific, they vary across rule sets
- Next, default classes are assigned to the
remaining tokens
4 These default classes
are always the same regardless of the rule set used
- Lexical patterns are assembled from the
classification results
4 Concatenated string of
the classes assigned to the parsed tokens
Classification – order
§ First, key words that can provide special context are classified
4 Provided by the standardization rule set classification table
4 Since these classes are context specific, they vary across rule sets
§ Next, default classes are assigned to the remaining tokens
4 These default classes are always the same regardless of the rule set
used
§ Lexical patterns are assembled from the classification results
4 Concatenated string of the classes assigned to the parsed tokens
Classification Example

Default Classes
Class
|
Description
|
^
|
A single numeric token
|
+
|
A single unclassified
alpha token
|
?
|
One or more
consecutive unclassified alpha tokens
|
>
|
Leading numeric mixed
token (i.e. 2B, 88WR)
|
<
|
Trailing numeric mixed
token (i.e. B2, WR88)
|
@
|
Complex mixed token
(i.e. NOT2B, C3PO, R2D2)
|
Default Classes (Special Characters)
- Some special characters are “reserved” for
use as default classes that describe token values that are not actual
special character values
4 For example: ^ + ? >
< @ (as described on the previous slide)
- However, if a special character is included
in the SEPLIST but omitted from the STRIPLIST, then the default class for
that special character becomes the special character itself and in this
case, the default class does describe an actual special character value
4 For example: Periods
(.), Commas (,), Hyphens (-)
4 It is important to note
this can also happen to the “reserved” default classes (for example: ^ = ^ if ^
is in the SEPLIST but omitted from the STRIPLIST)
- Also, if a special character is omitted
from both the SEPLIST and STRIPLIST (and it is surrounded by spaces in the
input data), then the “special” default class of ~ (tilde) is assigned
4 If not surrounded by
spaces, then the appropriate mixed token default class would be assigned (for
example: P.O. = @ if . is omitted from both lists)
Default Class (NULL Class)
- Has nothing to do with NULL values
- The NULL class is a special class
4 Represented by a numeric
zero (0)
4 Only time that a number
is used as a class
- Tokens classified as NULL are
unconditionally removed
- Essentially, the NULL class does to
complete tokens what the STRIPLIST does to individual characters
- Therefore, you will never see the NULL
class represented in the assembled lexical patterns
Classification Table
Classification Tables
contain three required space delimited columns:
- Key word that can provide special context
- Standard value for the key word
4 Standard value can be
either an abbreviation or an expansion
4 The pattern-action file
will determine if the standard value is used
- Data class (one character tag) assigned to
each key word
Classification Table Example

Tokens in the
Classification Table
§ A common misconception by new users is assuming that every input alpha
token should be classified by the classification table
4 Unclassified != Unhandled (i.e. unclassified tokens can still be
processed)
§ Classification table is intended for key words that provide special
context, which means context essential to the proper processing of the data
§ General requirements for tokens in the classification table:
4 Tokens with standard values that need to be applied (within proper
context)
§ Tokens that require standard values, especially standard abbreviations, will often map directly
into their own dictionary columns
§ Does not mean that
every dictionary column requires a user defined class
4 Tokens with both a high individual frequency and a low set cardinality
§ Low set cardinality means that the token belongs to a group of related
tokens that have a relatively small number of possible values and therefore the
complete token group can be easily maintained in the classification table
§ If high set cardinality, adjacent tokens can often provide necessary
context.

What is Dictionary File ?
- Defines the output columns created by the
standardization rule set
- When data is moved to these output columns,
it is called “bucketing”
- The order that the columns are listed in
the dictionary file defines the order the columns appear in the
standardization rule set output
- Dictionary file entries are used to
automatically generate the column metadata available for mapping on the
Standardize Stage output link
Dictionary File Example

Dictionary File Fields
(Output Columns)
- Standardization
can prepare data for all of its uses and therefore most dictionary files
contain three types of output columns:
- Business Intelligence
4 Usually comprised of the parsed and standardized input tokens
- Matching
4 Columns specifically intended to facilitate more effective matching
4 Commonly includes phonetic coding fields (NYSIIS and SOUNDEX)
- Reporting
4 Columns specifically intended to assist with the evaluation of the
standardization results
Standard Reporting
Fields in the Dictionary File

- Unhandled
Pattern – the lexical pattern representing the unhandled data
- Unhandled Data – the tokens left
unhandled (i.e. unprocessed) by the rule set
- Input Pattern – the lexical pattern
representing the parsed and classified input tokens
- Exception Data – place holder column for
storing invalid input data (alternative to deletion)
- User Override Flag – indicates whether or
not a user override was applied (default = NO)
What is Pattern-Action File ?
- Drives the logic of the standardization
process
- Configures the parsing parameters
(SEPLIST/STRIPLIST)
- Configures the phonetic coding (NYSIIS and
SOUNDEX)
- Populates the standardization output
structures
- Written in Pattern-Action Language, which
consists of a series of patterns and associated actions structured into
logical processing units called Pattern-Action Sets
- Each Pattern-Action Set consists of:
4 One line containing a
pattern, which is tested against the current data
4 One or more lines of
actions, which are executed if the pattern tested true
- Pattern-Action Set Example

Pattern-Action File Structure


Standardization vs. Validation
- In QualityStage, standardization and
validation describe different, although related, types of processing
- Validation extends the functionality of
standardization
- For example: 50 Washington Street,
Westboro, Mass. 01581
4 Standardization can
parse, identify, and re-structure the data as follows:
§ House Number = 50
§ Street Name = WASHINGTON
§ Street Suffix Type = ST
§ City Name = WESTBORO
§ State Abbreviation = MA
§ Zip Code = 01581
4 Validation can verify
that the data describes an actual address and can also:
§ Correct City Name =
WESTBOROUGH
§ Append Zip + 4 Code =
1013
4 Validation provides this
functionality by matching against a database
How to Deal with Un Handled data ?
- There are two reporting fields in all
delivered rule sets:
4 Unhandled Data
4 Unhandled Pattern
- To identify and review unhandled data:
4 Investigate stage on the
Unhandled Data and Unhandled Pattern columns
4 SQA stage on the output
of the Standardize stage
- Unhandled data may represent the entire
input or a subset of the input
- If there is no unhandled data, it does not
necessarily mean the data is processed correctly
- Some unhandled data does not need to be
processed, if it doesn’t belong to that domain
- Processing of a rule set may be modified
through overrides or pattern action language
User Overrides
- Most standardization
rule sets are enabled with user overrides
- User
overrides provide the user with the ability to make modifications without
directly editing the classification table or the pattern-action file
- User
Overrides are:
4 Entered via simple GUI screens
4 Stored in specific object within the rule set
4 Classification overrides can be used to add classifications for tokens
not in the classification table or to replace existing classifications already
in the classification table
4 The following pattern/text override objects are called based on logic in
the pattern-action file
§ input pattern
§ input text
§ unhandled pattern
§ unhandled text
Domain Specific Override Example

Classification Override

Input Text Override

Input Pattern Override

User Modification Subroutines
- There are
two subroutines in each delivered rule set that are specifically for users
to add pattern action language
- User
modifications within the pattern action file:
4 Input Modifications
§ This subroutine is called after the Input User Overrides are applied but
before any of the rule set pattern actions are checked
4 Unhandled Modifications
§ This subroutine is called after all the pattern actions are checked and
the Unhandled User Overrides are applied
Pattern Action Language
What is Matching ?
Matching is
the real heart of Quality Stage. Different probabilistic algorithms are
available
for different types of data. Using the frequencies developed during
investigation
(or subsequently), the information content (or “rarity value”) of each value
in each
field can be estimated. The less common a value, the more information it
contributes
to the decision. A separate agreement weight or disagreement weight is
calculated
for each field in each data record, incorporating both its information content
(likelihood
that a match actually has been found) and its probability that a match has
been found
purely at random. These weights are summed for each field in the record to
come up with
an aggregate weight that can be used as the basis for reporting that a
particular
pair or records probably are, or probably are not, duplicates of each other.
There is a
third possibility, a “grey area” in the middle, which Quality Stage refers to
as
the
“clerical review” area – record pairs in this category need to be referred to a
human
to make the
decision because there is not enough certainty either way. Over time the
algorithms
can be tuned with things like improved rule sets, weight overrides, different
settings of
probability levels and so on so that fewer and fewer “clericals” are found.
Matching
makes use of a concept called “blocking”, which is an unfortunately-chosen
term that
means that potential sets of duplicates form blocks (or groups, or sets) which
can be
treated as separate sets of potentially duplicated values. Each block of
potential
duplicates
is given a unique ID, which can be used by the next phase (survivorship) and
can also be
used to set up a table of linkages between the blocks of potential duplicates
and the keys
to the original data records that are in those blocks. This is often a
requirement
when de-duplication is being performed, for example when combining
records from
multiple sources, or generating a list of unique addresses from a customer
file, et cetera.
More than
one pass through the data may be required to identify all the potential
duplicates.
For example, one customer record may refer to a customer with a street
address but
another record for the same customer may include the customer’s post
office box
address. Searching for duplicate addresses would not find this customer; an
additional
pass based on some other criteria would also be required. Quality Stage does
provide for
multiple passes, either fully passing through the data for each pass, or only
examining the unmatched records on subsequent passes (which is usually
faster).
Matching vs. Lookups, Joins,
and Merges
Within Information Server, multiple stages offer capability that can
be considered matching, for example:
4 Lookup
4 Join
4 Merge
4 Unduplicate Match
4 Reference Match
- Lookups,
Joins, and Merges typically use key attributes, exact match criteria, or
matches to a range of values or simple formats
- The
Unduplicate Match Stage and Reference Match Stage offer probabilistic
matching capability
There are two types of match
stage
ü Unduplicate match :locates and groups all similar records within a
single input data source. This process identifies potential duplicate
records,which might then be removed
ü Reference Match identifies relationships among records in two data sources. An example
of many-to-one matching is matching the ZIP codes in customer file with the list of valid ZIP
codes. More than one record in the customer file can have the same ZIP code in
it.
Blocking step
ü Blocking provides a method of limiting the number of pairs to examine.
When you partition data sources into mutually-exclusive and exhaustive subsets
and only search for matches within a subset, the process of matching becomes
manageable.
ü Basic blocking concepts include:
ü Blocking partitions the sources
into subsets that make computation feasible. Block size is the single most
important factor in match performance. Blocks should be as small as possible
without causing block overflows. Smaller blocks are more efficient than larger
blocks during matching.
Reference Match Stage
ü The Reference Match stage identifies relationships among records. This
match can group records that are being compared in different ways as follows:
ü One-to-many matching
ü Many-to-one matching
One-to-many matching
ü Identifies all records in one data source that correspond to a record
for the same individual, event, household, or street address in a second data
source.
ü Only one record in the reference source can match one record in the
data source because the matching applies to individual events.Eg: finding the same individual based on
comparing SSN in voter registration list and department of motor vehicles list.
Many-to-one matching
- Multiple
records in the data file can match a single record in the reference file.
Eg: matching a transaction data source to a master data source allows
many transactions for one person in the master data source.
The Reference match stage
delivers up to six outputs as follows:
– Match contains matched
records for both inputs
– Clerical has records that
fall in the clerical range for both inputs
– Data Duplicate contains
duplicates in the data source
– Reference Duplicate
contains duplicates in the reference source
– Data Residual contains
records that are non-matches from the data input
– Reference Residual
contains records that are non-matches from
the reference input




Survivorship
As the name
suggests survivorship is about what becomes of the data in these blocks of
potential
duplicates. The idea is to get the “best of breed” data out of each block,
based
on built-in
or custom rules such as “most frequently occurring non-missing value”,
“longest
string”, “most recently updated” and so on.
The data
that fulfill the requirements of these rules can then be handled in a couple of
ways. One
technique is to come up with a “master record” – a “single version of the
truth” –
that will become the standard for the organization. Another possibility is that
the improved
data could be populated back into the source systems whence they were
derived; for
example if one source were missing date of birth this could be populated
because the
date of birth was obtained from another source. Or more than one. If this
is not the
requirement (perhaps for legal reasons), then a table containing the linkage
between the
source records and the “master record” keys can be created, so that the
original,
source systems have the ability also to refer to the “single source of truth”
and
vice versa.
Address Verification and Certification
Quality
Stage can do more (than simple matching). Address verification can be
performed;
that is, whether or not the address is a valid format can be reported. Out of
the box
address verification can be performed down to city level for most countries.
For
an extra
charge, an additional module for world-wide address verification (WAVES) can
be
purchased, which will give address verification down to street level for most
countries.
For some
countries, where the postal systems provide appropriate data (for example
SERP in the
USA, CASS in Canada, DPID in Australia), address certification can be
performed:
in this case, an address is given to Quality Stage and looked up against a
database to
report whether or not that particular address actually exists. These
modules
carry an additional price, but that includes IBM obtaining regular updates to
the data from the postal authorities and providing them to the Quality
Stage licensee.
New Address Verification Module

Summary
- IBM is planning to release its next version of
Info Sphere Quality Stage Worldwide Address Verification module (v10)
4 Release
time frame is Q4 2012
4 AVI
v10 will have superior functionality and coverage over our current AVI v8.x
module à see slide 4
4 AVI
v10 will leverage new address/decoding reference data
4 AVI
v10 will have broad support for various Information Server versions à see
slide 5
- For current AVI v8.x customers only:
4 AVI
v8.x will have continues support until end of Dec. 2013
§ Address
reference data for AVI v8.x has been discontinue by the vendor is ending in
Dec. 2013
4 AVI
v10 will include a Migration utility for automated migration from AVI v8.x to
AVI v10
4 For
comparison AVI v10 and AVI v8 can run side-by-side (for development)

Information Server / Operating System support matrix for
AVI v10




Stage Icon and Location

Quality Stage Benefits
Quality
Stage provides the most powerful, accurate matching available, based on
probabilistic
matching technology, easy to set up and maintain, and providing the
highest
match rates available in the market.
An
easy-to-use graphical user interface (GUI) with an intuitive, point-and-click
interface
for
specifying automated data quality processes – data investigation,
standardization,
matching,
and survivorship – reduces the time needed to deploy data cleansing
applications.
Quality
Stage offers a thorough data investigation and analysis process for any kind of
free
formatted data. Through its tight integration with Data Stage and other
Information
Server
products it also offers fully integrated management of the metadata associated
with those
data.
There exists
rigorous scientific justification for the probabilistic algorithms used in
Quality Stage;
results are easy to audit and validate.
Worldwide
address standardization verification and enrichment capabilities – including
certification
modules for the United States, Canada, and Australia – add to the value of
cleansed
address data.
Domain-agnostic
data cleansing capabilities including product data, phone numbers,
email
addresses, birth dates, events, and other comment and descriptive fields, are
all
handled.
Common data quality anomalies, such as data in the wrong field or data
spilling
over into the next field, can be identified and addressed.
Extensive
reporting providing metrics yield business intelligence about the data and help
tune the
application for quality assurance.
Service
oriented architecture (SOA) enablement with Info Sphere Information Services
Director,
allowing you to leverage data quality logic built using the IBM Info Sphere
Information
Server and publish it as an "always on, available everywhere" service
in a
SOA – in
minutes.
The bottom
line is that Quality Stage helps to ensure that systems deliver accurate,
complete, trusted information to business users both within and
outside the enterprise.