09 June 2009

Essentials of Data Quality for Predictive Modeling

In my last blog, I mentioned a change in actuarial analysis that is taking place that leads us to look at the data in a whole new way. More actuaries are using Predictive Modeling in place of or to supplement traditional insurance pricing. Many different models can be used for predictive modeling, but the most common model is Generalized Linear Modeling (GLM). Before I discuss the data quality considerations, I need to first talk about why there has been a movement towards GLM.

In traditional pricing, an overall rate level is determined. One of the techniques to do this was described in my last blog. Once this is done, the actuary must then determine how rates should differ based on characteristics of the policyholder. This is called class rating. If the risk underlying the policy is large enough, the actual claims experience for that risk will be used to determine the rate. This discussion assumes that is not the case.

One-Way Analysis
When class rating first started, each rating factor was analyzed separately. The actuary would aggregate the experience by class and compare the adjusted loss ratios for each class. The loss ratio is also adjusted based on the current rates so that all classes are on the same basis. The loss ratios for each class are compared to a base class. The ratio of the adjusted loss ratio to the base class determines the rate applied to that class.

Two-Way Analysis
Actuaries found that it was not adequate to perform one-way analysis for every factor. This is because there could be correlations between different rating factors. The classical example used is in Personal Auto insurance. Consider two sets of rating factors. The first factor is Age and the second is Gender. Experience has shown that young males have worse claims experience than young females and that younger drivers have worse claims experience than older drivers (up to a certain age). However, as drivers age, the difference between males and females claim experience is almost gone.

If we look at these factors separately, then a older male driver would have a higher rate than a female driver of the same age. This is because we would look at the total of all males vs. the total of all females. Since the younger males and females are in that data, the experience is worse for males and thus would have a higher rate. Two-way analysis solved the problem when two factors are correlated.

Generalized Linear Modeling
GLM expands the analysis to multiple variables. The are several disadvantages to two-way analysis:
  • There may be 3 or more factors that are correlated. Age of vehicle, type of vehicle, etc. may also be correlated with age and gender.
  • You have to actually know the correlations before doing the analysis. It is common knowledge that gender and age are correlated, but there may be some less obvious correlations not apparent.

Goals of Data Quality for Predictive Modeling
Predictive modeling is a statistical exercise. Therefore, the be able to create a model, data is essential. Unlike aggregated data, where a few key fields are aggregated, predictive modeling requires detailed data. This means data needs to be at the policy and claims level. In addition, many more fields are analyzed in predictive modeling than in traditional pricing.

For a model to converge, the data must be complete. If there are missing values, then the records that have those missing values cannot be used in the modeling. One of the techniques for resolving this is imputations, but the quality of the imputations is subject to wide variation.

The quality of the data impacts the predictive accuracy. Therefore, if the data is inaccurate, incomplete and inconsistent, then the prediction will also be inaccurate, incomplete and inconsistent.

Data Quality Tests
I group the Data Quality tests into four main categories. These are the essentials and many times the results of the tests and the subsequent drill downs will lead to more tests.

Integration Tests
My experience is that the data that is required in predictive modeling will come from many different data sources. Therefore, the data must be integrated. In performing this integration, there is a change that records could be duplicated or deleted unintentionally. The integration tests are performed to compare the destination data set with the source data. Ways to test this are by comparing records counts and key numerical fields.

Reconciliation
Reconciliation is comparing the results of key fields with reports that are familiar to management, finance, actuarial and others. Any difference between the two should be able to be explained. Many times, these differences can lead to improvement in the quality of the data. Other times, they may lead the modeler to filter out certain data from the data set before analysis.

Data Profiling
The main tests we do for data profiling are looking at fill rates, frequency distributions and min/max values. The data quality analyst should look at the results of these tests and make a determination of areas to analyze next Arkady Maydanchik (co-founder of Data Quality Group) says “Data Profiling does not answer any questions - it helps us ask meaningful questions”.

Business Rules
There are certain business rules that should hold for insurance data. The policy effective date should be before the date that the claims occurs (loss date). Every claim should have corresponding policy. Premiums and Claims should not be negative.

In addition, some tests are done on fields that may be possible, but not probable. For example, in many Commercial Insurance policies, there may be a minimum premium. However, exceptions may be made under certain circumstances, so a policy violating the minimum premium business rule does not necessarily mean it is wrong. It just signals an area for further analysis.

Data Scrubbing
If there are data quality issues, the data will need to be cleaned to prepare it for modeling. There are several techniques that we have used to clean the data.

Normalization – Adjusting a group of fields so that they add to a given value. This is usually done when the fields should add to 100%, but they add to some other value. The data is normalized by taking the ratio of the total of all fields divided by 100% and multiplying each value by that amount.

Imputations – This is the filling in of missing data. Different techniques that we have used are carry forward, carry backward, default value and expert opinion. Carry forward and carry backward involve using values from previous or future records. This is used with the knowledge that certain values are unlikely to change if you carry them forward into the future or carry them back into the past. A default value is used to fill in data when it is assumed that a missing value would infer a given answer. For example, if an application question would give a discount for answering “Yes”, then you could infer that anyone who didn’t answer would have chosen “No”. Expert Opinion involves the knowledge of experts in inferring any missing data.

Translations – This involves changing the value of the data to be consistent with other values. For example, you could have some records with a 0 and 1 and others with a Y and N. The 0 and 1 are likely to be equivalent to N and Y respectively.

Cleaning – This is the correction of erroneous data. There are times where we find data that is wrong. If there is an authoritative source available, you can replace it with the correct value.

Mapping – Mapping is similar to translation, but usually involves creating a new field that groups the values together. For example, one data source could have a code of AA while another data set has a code of 123AL. These codes could both mean “Alabama Program”. Without the mapping, there is no way to match these records.

Other Considerations
Although, the technical aspects of data quality described above are important to any predictive modeling, the project will fail (or be delayed) if there is not clear documentation of the processes or communication between all parties.

01 June 2009

Traditional Insurance Pricing - A Data Quality Perspective

As I said in my first blog, my formal training is as a pricing actuary. In traditional pricing, the emphasis is on projecting the overall price for insurance. This may either be at a portfolio level (e.g. all personal auto policies in the state of Missouri) or at an individual level (i.e. Liability insurance for a Fortune 500 company).

For property and casualty insurance, one common technique is to use past premium and claims data to project future claim costs. Once these future costs are determined, additional amounts for expenses, capital costs, profits and contingencies are added to the costs.

The additional components have their own data challenges, but today I am going to talk about the data quality considerations for projecting future claim costs. To begin, there are certain actuarial and/or statistical assumptions that I will start with. Sometimes these hold in the real world and sometimes they don't, but they are not central to the data quality discussion.

Assumptions:
  • There is a sufficient amount of data to be able to make a future projection.
  • The policies that were written in the past are expected to be similar to the policies written in the future.
  • The data is homogeneous
  • The data is not distorted by the existence of an unexpected number large claims (either too few or too many)
Adjustments:

Once you have the data from the past, there are several things that need to be done to the data. Usually the premium and claims data is aggregated by year. This is either by Policy Year or Earned/Accident Year. Policy Year is based on the effective date of the policy and Accident Year is based on when the premium is earned and the claims are incurred. The choice of which to use does not change the technique. It only changes the values of the parameters that are used to adjust the data.

There are three main adjustments made to the data. These adjustments are done for two main reasons. The first is to adjust the data so that it is on today's basis and the second is to project it to the basis of time period for which the future policies will be written.

First, over the time period of the data, the policies were written at many different rates. The goal of pricing using this method is to determine how much TODAY's rates need to be changed for the future period. Therefore, the premium from past year's needs to be adjusted so that they are equivalent to today's rates.

Second, when claims occur, there will usually be payments on many claims at the time of the analysis. Adjustments have to be made to the data to adjust the claims to their ultimate value.

Third, the claims (and sometimes the premium) need to be adjusted today's dollars. They then need to be adjusted to the period in which the policies will be written.

Loss Ratio:
After these adjustments are made, a loss ratio is calculated for each year by taking the claim costs divided by the premium. They are averaged over all of the years (either simple or weighted averages) and then compared to a target loss ratio. If the loss ratio is higher than the target loss ratio, then the price needs to be increased. If it is lower, the price needs to be decreased.

Data Quality Considerations
For traditional pricing, data is aggregated. Because of this there are two main data quality considerations:
  • Reconciliation - The premium and claims data that is used for pricing can be compiled in various ways. It can be from a Data Warehouse, Data Marts, directly from source systems or various reports. When management sees the results of the pricing analysis, they will compare this to data that they have seen from other reports. Therefore, the premium and claims should be reconciled to a report that management is familiar with. The purpose of this reconciliation exercise is not to match with the management reports. It is to be able to explain any differences that exist between the two.
  • Reasonableness - The actuary that is performing the analysis is usually someone who is very familiar with the product that he/she is pricing. Therefore, when the premium and claims are aggregated and when the adjusted loss ratio is calculated, the values should be within a range of reasonableness expected by the actuary.
Because the data is aggregated, many dimensions such as completeness, consistency, duplication, etc. are not always addressed in traditional pricing analysis. Many times the actuary will assume that these tests have already been conducted to the data beforehand and will trust the data as given.

In my next blog, I will be talking about a change in actuarial analysis that is taking place that leads us to look at the data in a whole new way and has required a change how the actuary considers data quality.

30 May 2009

Introduction

I have been in love with Data from the time I first laid eyes on her. (Is data gender-specific?) But I am not formally trained in Data Management or Data Quality. However, through various career moves, I now manage a data management team and almost all my work involves data management and data quality.

My formal training is as an Actuary. And my 13-year career has been entirely working in the insurance industry, mostly involved with pricing of insurance policies. To determine the prices, we have to look at how the past has performed to try to project the future. There is definitely a lot of data that is involved in these projections.

Because of the sheer amount of data needed for pricing decisions, I had been doing some form of data quality throughout my career, though very informal. Either it would be reconciling data with other sources, checking it for reasonableness or checking to see if certain business rules were followed. (I did not call them business rules at that time.)

Last year, for the first time, I embarked on my first formal data quality project. It was met with some successes and MANY learning opportunities (I don't believe in failures). I learned a lot from this first project and have applied that knowledge to help build a process for my current projects.

Since I was not formally trained on Data Quality, I have had to do a lot of research and I have found a lot of interesting sources on the web which have been invaluable. One such resource is DataQualityPro.com. I saw a tweet the other day about "20 simple tips to spice up your data quality blog" at http://www.dataqualitypro.com/data-quality-home/20-simple-tips-to-spice-up-your-data-quality-blog.html. I had been reading other blogs, especially those by Dylan Jones on DataQualityPro.com, Daragh O'Brien at iaidq.org and Jim Harris at ocdqblog.com. These blogs have given my some good resources to use for my own work.

I was reading through the 20 tips that Dylan had provided and started thinking of ideas for my own blog. My perspective on data quality will be as an Actuary and from the insurance industry. It is a different perspective than what is seen in much of the data quality literature. I will be expanding on it in future blogs.

I look forward to the insights from the experts (and others who just have opinions) since I still consider myself a novice in the world of data quality.