Jeremy's Thoughts on Data Quality: Essentials of Data Quality for Predictive Modeling

In my last blog, I mentioned a change in actuarial analysis that is taking place that leads us to look at the data in a whole new way. More actuaries are using Predictive Modeling in place of or to supplement traditional insurance pricing. Many different models can be used for predictive modeling, but the most common model is Generalized Linear Modeling (GLM). Before I discuss the data quality considerations, I need to first talk about why there has been a movement towards GLM.

In traditional pricing, an overall rate level is determined. One of the techniques to do this was described in my last blog. Once this is done, the actuary must then determine how rates should differ based on characteristics of the policyholder. This is called class rating. If the risk underlying the policy is large enough, the actual claims experience for that risk will be used to determine the rate. This discussion assumes that is not the case.

One-Way Analysis
When class rating first started, each rating factor was analyzed separately. The actuary would aggregate the experience by class and compare the adjusted loss ratios for each class. The loss ratio is also adjusted based on the current rates so that all classes are on the same basis. The loss ratios for each class are compared to a base class. The ratio of the adjusted loss ratio to the base class determines the rate applied to that class.

Two-Way Analysis
Actuaries found that it was not adequate to perform one-way analysis for every factor. This is because there could be correlations between different rating factors. The classical example used is in Personal Auto insurance. Consider two sets of rating factors. The first factor is Age and the second is Gender. Experience has shown that young males have worse claims experience than young females and that younger drivers have worse claims experience than older drivers (up to a certain age). However, as drivers age, the difference between males and females claim experience is almost gone.

If we look at these factors separately, then a older male driver would have a higher rate than a female driver of the same age. This is because we would look at the total of all males vs. the total of all females. Since the younger males and females are in that data, the experience is worse for males and thus would have a higher rate. Two-way analysis solved the problem when two factors are correlated.

Generalized Linear Modeling
GLM expands the analysis to multiple variables. The are several disadvantages to two-way analysis:

There may be 3 or more factors that are correlated. Age of vehicle, type of vehicle, etc. may also be correlated with age and gender.
You have to actually know the correlations before doing the analysis. It is common knowledge that gender and age are correlated, but there may be some less obvious correlations not apparent.

Goals of Data Quality for Predictive Modeling
Predictive modeling is a statistical exercise. Therefore, the be able to create a model, data is essential. Unlike aggregated data, where a few key fields are aggregated, predictive modeling requires detailed data. This means data needs to be at the policy and claims level. In addition, many more fields are analyzed in predictive modeling than in traditional pricing.

For a model to converge, the data must be complete. If there are missing values, then the records that have those missing values cannot be used in the modeling. One of the techniques for resolving this is imputations, but the quality of the imputations is subject to wide variation.

The quality of the data impacts the predictive accuracy. Therefore, if the data is inaccurate, incomplete and inconsistent, then the prediction will also be inaccurate, incomplete and inconsistent.

Data Quality Tests
I group the Data Quality tests into four main categories. These are the essentials and many times the results of the tests and the subsequent drill downs will lead to more tests.

Integration Tests
My experience is that the data that is required in predictive modeling will come from many different data sources. Therefore, the data must be integrated. In performing this integration, there is a change that records could be duplicated or deleted unintentionally. The integration tests are performed to compare the destination data set with the source data. Ways to test this are by comparing records counts and key numerical fields.

Reconciliation
Reconciliation is comparing the results of key fields with reports that are familiar to management, finance, actuarial and others. Any difference between the two should be able to be explained. Many times, these differences can lead to improvement in the quality of the data. Other times, they may lead the modeler to filter out certain data from the data set before analysis.

Data Profiling
The main tests we do for data profiling are looking at fill rates, frequency distributions and min/max values. The data quality analyst should look at the results of these tests and make a determination of areas to analyze next Arkady Maydanchik (co-founder of Data Quality Group) says “Data Profiling does not answer any questions - it helps us ask meaningful questions”.

Business Rules
There are certain business rules that should hold for insurance data. The policy effective date should be before the date that the claims occurs (loss date). Every claim should have corresponding policy. Premiums and Claims should not be negative.

In addition, some tests are done on fields that may be possible, but not probable. For example, in many Commercial Insurance policies, there may be a minimum premium. However, exceptions may be made under certain circumstances, so a policy violating the minimum premium business rule does not necessarily mean it is wrong. It just signals an area for further analysis.

Data Scrubbing
If there are data quality issues, the data will need to be cleaned to prepare it for modeling. There are several techniques that we have used to clean the data.

Normalization – Adjusting a group of fields so that they add to a given value. This is usually done when the fields should add to 100%, but they add to some other value. The data is normalized by taking the ratio of the total of all fields divided by 100% and multiplying each value by that amount.

Imputations – This is the filling in of missing data. Different techniques that we have used are carry forward, carry backward, default value and expert opinion. Carry forward and carry backward involve using values from previous or future records. This is used with the knowledge that certain values are unlikely to change if you carry them forward into the future or carry them back into the past. A default value is used to fill in data when it is assumed that a missing value would infer a given answer. For example, if an application question would give a discount for answering “Yes”, then you could infer that anyone who didn’t answer would have chosen “No”. Expert Opinion involves the knowledge of experts in inferring any missing data.

Translations – This involves changing the value of the data to be consistent with other values. For example, you could have some records with a 0 and 1 and others with a Y and N. The 0 and 1 are likely to be equivalent to N and Y respectively.

Cleaning – This is the correction of erroneous data. There are times where we find data that is wrong. If there is an authoritative source available, you can replace it with the correct value.

Mapping – Mapping is similar to translation, but usually involves creating a new field that groups the values together. For example, one data source could have a code of AA while another data set has a code of 123AL. These codes could both mean “Alabama Program”. Without the mapping, there is no way to match these records.

Other Considerations
Although, the technical aspects of data quality described above are important to any predictive modeling, the project will fail (or be delayed) if there is not clear documentation of the processes or communication between all parties.

1 comment:

adityaJune 10, 2009 at 11:33 AM
This is a great post with a clear detailed roadmap of data quality.Jeremy, i suggest that you may insert a counter to keep tab of how many visitors are viewing your page.There are many free trackers available, i suggest this 1 :(extremetracking.com)(here`s an example of what it looks like on one of my blogs : http://extremetracking.com/open?login=gurgaon1)
you just have to make an account there and insert some code at the footer of ur page and you`ll have live stats...

-Adi

Jeremy's Thoughts on Data Quality

09 June 2009

Essentials of Data Quality for Predictive Modeling

1 comment:

Disclaimer

Followers

Blog Archive

Total Pageviews