Tech2Tech
Tech Knowledge
Inside Job
In-database analytics improves business performance, efficiency and more.
by Seth Grimes
In the face of ever-growing data volumes and complexity, in-database analytics is the enterprise data warehouse (EDW) response to support high-throughput, operational analytics for such needs as fraud detection, credit scoring, risk management and online commerce.
Analytical computing often involves large numbers of data records and fields, carried out repeatedly. In-database execution avoids data movement and takes advantage of the power of modern, parallel database architectures. Users see dramatically improved performance and increased processing throughput.
While the most visible advantage of database-embedded computations is higher performance, organizations that take this approach also gain greater manageability and reliability as byproducts of centralizing analytical processing. Further, such centralization facilitates version control and reduces duplication of data. It also extends the database’s management, security and auditing infrastructure to analytical data and functions.
How It Works
Execution of complex or compute-intensive calculations is shifted from a desktop program or application server to the database management system. The computations are coded in a programming language, either by the end user or a third-party provider, and made available through a database query language such as SQL. Possibilities go far beyond computations that can be written directly in industry-standard versions of SQL.
Database-embedded functions are available to any user or program with access permission. They are invoked by SQL queries or through any front-end application or user interface that accesses the database. End users may not even know their analytical software is running calculations in the database, except indirectly through improved performance.
Must-Have Capability
With the growing popularity of analytical database systems, in-database analytics has become a must-have capability. Not every system, however, will support it. A database that can be extended with custom computations is required. Although many vendors can extend their own systems, you do not want to be limited to vendor-created or vendor-sanctioned extensions. For generalized, robust in-database analytics, your best option is a framework that supports database management system (DBMS) extensions created by specialists, whether from your own company or third-party providers.
Whether they are user-defined functions or specialized routines such as Teradata input modules (INMODs), extensions become part of the database and part of the EDW. The DBMS automatically allows for these extensions within the scope of utilization, performance monitoring, user management and auditing tools. The DBMS will also cover them within the query execution planning, scheduling and optimization facilities.
If you’re already using analytics tools and packages such as those from SAS, you’re well positioned to take advantage of new capabilities. Such vendors have been working with database companies, including Teradata, to move analytical processing from the application layer into the database. SAS has rewritten many basic procedures and functions, and complete applications, to run inside the Teradata Database.
These database-embedded functions support data preparation, modeling and analysis tasks and are used automatically by some SAS user-facing applications. While SAS has created connectors for many database management systems, they do not cause SAS procedures or functions to be executed in the database.
Significant Gains
In-database analytics technology builds on end-user programming facilities that first emerged almost 20 years ago. In the mid-1990s, Teradata offered one of the first commercial systems to support end-user extensions for in-database mining. Object-relational database systems from other companies also provided extensions, but they did not focus on analytical processing or provide for analytics parallelization.
Parallelization decomposes a large workload into smaller tasks that are carried out simultaneously. Running a single task in multiple locations, with distinct data subsets, speeds up the overall execution of large tasks and workloads. Data-related tasks that can be parallelized include database loading, index building, queries over large data sets and many analytical computations. The Teradata system is an example of a parallel database. Complemented by Teradata Parallel Transporter, it provides scalable, high-speed, parallel data extraction, loading and updating.
Performance gains due to eliminating data movement will be in proportion to data volume. The possibility of significant gains exists where calculations are concerned. That’s because modern databases have powerful query optimization and caching mechanisms, as well as the ability to parallelize operations not typically available to, or present with, analytics options other than in-database.
As a rule of thumb, expect improvements of 10 times to 100 times for big data computations that can be parallelized. For example, a Teradata customer had spent 312 hours to run five models against 250 million rows, extracting the data onto an analytic server. After the system was converted into SQL models leveraging in-database mining techniques, all five analyses ran in just 12 hours—a 97% savings in time.
Developing Database-Embedded Code
Most modern software development is conducted using a standard integrated development environment (IDE) that encapsulates a continuous-integration methodology. However, it is almost always possible to develop and deploy code the old-fashioned way—using a text editor and text commands.
Teradata and other vendors support database-application development through the Eclipse open-source IDE, an open development platform composed of extensible frameworks, tools and runtimes for building, deploying and managing software across the life cycle. Database vendors usually provide a free, downloadable Eclipse plug-in that uses the Eclipse plug-in framework. Teradata offers a free plug-in that can be accessed from the Teradata Tool Kit on Teradata.com.
Expanding Possibilities
The value of in-database analytics is in addressing diverse business challenges. However, the technology can be applied wherever there is a large volume of data and a critical need to minimize data movement. It can also be used to reduce processing and response times, and when a desire to move toward managed enterprise analytics exists. The move in-database is not an all-or-nothing proposition. Organizations can exploit the technology for individual applications or functions based on business priorities and on the availability of database-optimized versions of software from analytics providers. Migration can be gradual and should provide measurable performance and analytics-manageability return on investment (ROI) as well as enhanced analytics capabilities: the possibility of doing more with your data.
Founder of the Alta Plana Corp., Seth Grimes is an IT analyst and analytics strategy consultant with more than 25 years of experience.