| home > articles > Data
warehousing Data Warehousing: Strategies, Technologies And Techniques Statistical
Analysis
Source: www.spss.com
Copyright SPSS, Inc. 2004
Data mining is uncovering the hidden meaning and relationships in the massive amounts of
data stored in the data warehouse. In short, the value of the data warehouse lies in the
information that can be derived from its data through the mining process. Successful
mining of data relies on refining tools and techniques capable of rendering large
quantities of data understandable and meaningful. Since its creation in the 18th century,
statistics have served this purpose, providing the mathematical tools and analytic
techniques for dealing with large amounts of data. Today, as we are confronted with
increasingly large volumes of data, statistics are, more than ever, a critical component
of the data mining and refining toolkit that facilitates making effective business
decisions. What are statistics and why use them?
Way of thinking
Statistics is a general method of reasoning from data. It is a basic approach shared
by people in todays society to draw conclusions and make decisions in business and
in life. It lets us communicate effectively about a wide range of topics from sales
performance to product quality to operational efficiency. Statistics is the way that we
reason effectively about data and chance in everyday life. The goal of
statistical analysis is to gain insight through numbers. We will consider four important
aspects of statistics: developing good data, strategies for exploring data and drawing
conclusions from the data and presenting your results.
Producing data
You will have a wealth of data in the warehouse and available from outside sources.
There are important concepts to consider in selecting the data you actually use in your
analysis. These concepts are: sampling, experimentation and measurement. They are
important because the efficiency and accuracy of your analysis and therefore your
ability to draw useful conclusions in a timely manner are dependent on the quality
of the data reflecting the business situation.
Exploring data
Exploring data is important for understanding the quality of the data in the warehouse
and to begin looking for areas to mine for information. Exploring data will tell you if
most of the observations are missing or will indicate if the measurements are suspect
because of extreme variability. In effect, exploratory data analysis gives you a
feel for the data and will help uncover possible directions the analysis can
go. Just as the mining company explores the terrain looking for the place to put a mine
with the highest likelihood of success, so too does the data miner need to gain a sense of
where the key relationships are in the data. Probably equally important, exploring data
will serve to highlight any problems inherent in the database in terms of inaccurate or
missing data.
The first step in data analysis must be exploring it to see overall patterns and extreme
exceptions to the patterns. This is best done by graphing the data and visually
identifying the patterns and the number of exceptions. In exploring data we typically look
at each variable separately starting with basic counts and percentages which tell us the
number and proportion of measures at each level. Then we look at the distributions of the
data using charts like histograms, dot plots, boxplots, line charts and others. We also
look at some measure of the data that describe various characteristics of the data in
terms of average, variability and distribution.
Descriptive statistics include the following measures:
¨ Mean arithmetic average of the values
¨ Median the midpoint of values
¨ Mode the most frequent value
¨ Percentiles breaking the numbers in to groups by percentage of values above and below
¨ Variance average deviation of observations from the mean
¨ Standard deviation the spread of values around the mean
Drawing conclusions from data
Statistics are concerned with finding relationships between variables. Once one has
mined to an area with an interesting relationship, statistics provide the additional tools
to refine the data into an understanding of the strength of the relationship
and the factors that cause the relationship. For example order values and sales lead
sources are interesting characteristics to measure and summarize. But order value and
sales lead source for the same order give us significantly more information than either
measure alone. When we have the source and the value of orders linked, we can look for
associations between the source and value which will lead us to evaluating higher
promotion spending on the sources which bring the most high value orders or possibly on
the sources which bring the highest total revenue even if it is booked as smaller
transactions in higher volume.
Applications of statistics in data mining
Statistical analysis is the secret weapon of many successful businesses today. It is
the essential tool for mining the data you have, refining the data into useful information
and for leading you to other data you might want to acquire. Businesses who effectively
employ statistical analysis can increase revenues, cut costs, improve operating efficiency
and improve customer satisfaction. They can more accurately identify problems and
opportunities and understand their causes so that they can more quickly act to eliminate
threats or capitalize on opportunities.
|