Building Regression Models Directly in Your Oracle Database | Qatabase

Building Regression Models Directly in Your Oracle Database

Created by Praveen Polu in Oracle Database 8 Feb 2025
Share


In today’s data-driven world, predictive analytics is essential. Businesses are looking for ways to see the future and gain competitive advantages. While standard frameworks are useful for this end, they are very distant from where data stays. This is where Oracle Machine Learning for SQL (OML4SQL) shines as a powerful tool for predictive modeling within an Oracle database. OML4SQL is a great alternative to the movement of data, and today you will learn how to model data and start predicting.

In this comprehensive guide, we'll explore the practicalities of building and deploying regression models directly in Oracle. We'll focus on OML4SQL's capabilities for regression, using the Boston housing dataset as a vehicle to demonstrate key concepts.

Why Embrace In-Database Machine Learning?

Before diving into the steps of the how to, let's talk about some benefits:

  • No Data Replication: With all data contained in a single database, you are guaranteed that no extra copies are available.

  • Increased Security Since there are no copies of the data, the company knows that the data isn't in unmonitored hands.

  • Database Power Leveraging the power of the Oracle database, OML4SQL allows processing terabytes of data with ease.

Getting Started: The Boston Housing Dataset and Accessing Data

To ensure consistency, you need to follow the same steps as this post, by:

  1. First download the dataset from its repository at "https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/boston_house_prices.csv".

  2. You can use Oracle Autonomous Database or a local database to develop and test the code.

  3. Finally, copy data into the table.


The dataset contains the following attributes, as extracted from the book "Oracle® Database Machine Learning for SQL Use Cases":

  • CRIM: Per capita crime rate by town.

  • ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS: Proportion of non-retail business acres per town.

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • NOX: Nitric oxides concentration (parts per 10 million).

  • RM: Average number of rooms per dwelling.

  • AGE: Proportion of owner-occupied units built prior to 1940.

  • DIS: Weighted distances to five Boston employment centers.

  • RAD: Index of accessibility to radial highways.

  • TAX: Full-value property-tax rate per $10,000.

  • PTRATIO: Pupil-teacher ratio by town.

  • LSTAT: Percentage lower status of the population.

  • MEDV: Median value of owner-occupied homes in $1000s (Target)

  • Create a table called boston_housing using the following script:




To load the data, we use a OML notebook:



Now we can break this code and talk about errors:

It is important to create a credential with your user.

The file url can be wrong, it can have network errors.

You can make some debug code to see the exact line with errors to guide any debug with DBMS output

3. Building the GLM Regression Model

Leveraging OML4SQL, here's the SQL code to build the model using the DBMS_DATA_MINING.CREATE_MODEL2procedure:


More details about this code:

  • Handle errors: Added an EXCEPTION block for basic error handling. More refined error handling could check for specific OML errors.

    **You can select algorithm settings:

    * In the script, Automatic preparation is enabled.

    * Name of the algorithm is the the GLM.

    * Name of diagnostics table

4. Evaluating the Model

You need to run the following script:




5. Scoring for Predictions


    


6. Enhanced Model Deployment:

Now about deployment, you need to keep in mind some topics:

  • Using Scoring Views

  • Creating PL/SQL Functions

  • REST API Integration: OML4SQL integrates with Oracle REST Data Services (ORDS), allowing you to expose your models as REST APIs for consumption by external applications.

7. Algorithm Selection & Settings

  • The GLM algorithm is just one of the multiple algorithms, you can use many models for that

Oracle Machine Learning empowers data professionals to build and deploy predictive models directly within the database, but remember that this is a more focused approach for the sake of the business, since no new integration is needed. The data is also better, due to the same reasons and the code it's easy, since you only use SQL and the model will live where it is more secure and scalable.


  • Disclaimer: This article is meant to be for educational and demonstration purposes and reflects one writer's experience with it, you must seek the best methods for you.

Important: Remember to consult the official Oracle documentation and experiment with various algorithms and settings to achieve the best results for your specific use case.

Share

Share this post with others