In the world of predictive analytics, there's a constant tug-of-war between data richness and model efficiency. We crave vast datasets, teeming with features that promise deeper insights. Yet, all too often, we find ourselves wrestling with "the curse of dimensionality," where extra information actually hinders our ability to build effective, reliable models. Oracle Data Mining (ODM) offers a potent solution to this challenge through its Generalized Linear Models (GLMs) coupled with scalable feature selection. I will demonstrate the ways to build and select better features.
This article serves as your field guide to navigating the complex landscape of feature selection within Oracle Data Mining's GLM framework. We'll journey together through the following crucial topics:
Understanding the problem and addressing the curse of dimensionality;
Explanation of scalability with the PL/SQL language that ODM gives;
Analyzing the models and its results in the real world;
Implementing code and error handling;
How to deploy.
The Feature Flood: When Abundance Becomes a Burden
The modern data landscape is often characterized by an embarrassment of riches, but a great amount of data can be a problem. Every single feature only adds to the noise and the complications, which increase the model and risk it performing badly.
The effect of having a great amount of features is that more and more time is needed to make the models. This is especially important if the data's are not clean and it requires to perform in the data base. Lastly and most importantly, relationships can become too complex and this may lead to models that are inappropiate with high noise.
Taming the Chaos: GLM Feature Selection to the Rescue
Here is where Oracle's features shine. By using Oracle's robust and in-built feature selection methods, you can build the most accurate and efficient models. You're the one to select and identify the features. By doing this the key steps to create high quality data are achieved.
You begin by cleaning the data so there are no errors. You must test all the different errors. Run the models for different configurations to understand more what is best. The appropriate actions are taken to build a model.
In your model, test your connection and settings. You must ensure they're setup.
Create tables and build your PL/SQL procedures. This is where all models can be implemented. With this you can find and create all the models with high quality.
Now that you are all setup, begin to test by creating tests of the models and implement models. Now the magic starts with data mining. You begin the process.
Putting Feature Selection to Work: A Hands-On Example
Let's walk through a practical example of building a GLM with feature selection. We'll use PL/SQL to define the model settings, build the model, and examine the selected features. Here's how to do it. You are creating the data and are creating tables and then, using it for your own models.
CREATE TABLE glm_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000)
);
INSERT INTO glm_settings (setting_name, setting_value)
VALUES (DBMS_DATA_MINING.PREP_AUTO, DBMS_DATA_MINING.PREP_AUTO_ON);
INSERT INTO glm_settings (setting_name, setting_value)
VALUES (DBMS_DATA_MINING.FEAT_SELECTION, DBMS_DATA_MINING.FEAT_SEL_ON);
INSERT INTO glm_settings (setting_name, setting_value)
VALUES (DBMS_DATA_MINING.FEAT_SELECTION_CRITERION, DBMS_DATA_MINING.FEAT_SEL_CRIT_RIC);
INSERT INTO glm_settings (setting_name, setting_value)
VALUES (DBMS_DATA_MINING.FEAT_MAX_FEATURES, '50');
INSERT INTO glm_settings (setting_name, setting_value)
VALUES (DBMS_DATA_MINING.GLMS_FEATURE_PRUNE,'TRUE');
COMMIT;
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'my_glm_model',
mining_function => DBMS_DATA_MINING.REGRESSION,
data_table_name => 'my_training_data',
case_id_column_name => 'customer_id',
target_column_name => 'target_variable',
settings_table_name => 'glm_settings'
);
END;
/
SELECT PREDICTION(my_glm_model USING *) from test;
Navigating Errors and Building Success
During the creation of a model, you can be stopped with these messages:
* "Model \"MY_GLM_MODEL" completed.";*
This means your model is successful.
A negative example and what to look for when you make a mistake is:
ORA-20000: Mining: Invalid setting name: ORA-06512: at "SYS.DBMS_SYS_ERROR", line 79
ORA-06512: at "SYS.DBMS_DATA_MINING", line 2921
ORA-06512: at line 2.
When things come crashing down, look into these solutions:
You used invalid parameters.
There was the wrong connection and setup.
The test did not run correctly.
Also, if you use GLM too high you will see: ORA-12801: error signaled in parallel query server P000
By using these techniques you are prepared for all events that will happen and build a higher quality model.
Additional Resources/ Refrences
If you read this far you must be ready to jump into the power of Data Mining! Share and Explore new possibilities. What are you hoping to find with this?
This article is meant for educational and demonstration purposes and reflects one writer’s experience with it. You must seek the best methods for your use case.
Remember to consult the official documentation and experiment with various algorithms and settings to achieve the best results.