Efficiently Update Data from OLTP to OLAP
OLTP to OLAP: Acronym Time!
Online Transaction Processing, or OLTP, serves data to clients for line-of-business applications, enabling core business functionality. Online Analytical Processing, or OLAP, consists of analyzing OLTP data with the intent of providing reporting and business intelligence to management. One of our most important tasks is providing highly up-to-date data from the OLTP system into the OLAP system. Fresh data in the OLAP system allows business management to react quickly, whereas stale data in the OLAP system prevents quick, accurate decision making.
The last thing you want to do to a large, busy, OLTP system is to start running reporting against it. Even though this might provide business management with the freshest data possible, it will have the unfortunate side-effect of slowing down customer interaction. That might even have a negative effect on core business functionality, such as sales via a website, etc.
Get to the point!
This post presents a method that allows you to efficiently copy rows from OLTP tables into OLAP tables without needing to compare rows to see what changes have been made.
The code below creates a set of test tables; dbo.OLTP
which will be the source data, and dbo.OLAP
which we’ll efficiently copy fresh data into.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
IF OBJECT_ID(N'dbo.OLTP', N'U') IS NOT NULL DROP TABLE dbo.OLTP; IF OBJECT_ID(N'dbo.OLAP', N'U') IS NOT NULL DROP TABLE dbo.OLAP; IF OBJECT_ID(N'dbo.OLAP_Updates', N'U') IS NOT NULL DROP TABLE dbo.OLAP_Updates; GO CREATE TABLE dbo.OLTP ( Source_id int NOT NULL PRIMARY KEY CLUSTERED , oltp_data char(100) NOT NULL , rv rowversion NOT NULL ) ON [PRIMARY] WITH (DATA_COMPRESSION = PAGE); CREATE UNIQUE NONCLUSTERED INDEX OLTP_rv ON dbo.OLTP(rv) WITH ( DATA_COMPRESSION = PAGE , FILLFACTOR = 100 ); |
Note, the dbo.OLTP
table has a column named rv
, that is of type rowversion
1. This column is solely used to track changes made to rows, and allows us to efficiently seek to new rows via the unique, non-clustered, index named OLTP_rv
.
This code inserts 50,000,000 rows into the OLTP table:
1 2 3 4 5 6 |
INSERT INTO dbo.OLTP WITH (TABLOCKX) (Source_id, oltp_data) SELECT TOP(50000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) , oltp_data = REPLICATE('A', 100) FROM sys.syscolumns sc CROSS JOIN sys.syscolumns sc2 CROSS JOIN sys.syscolumns sc3; |
Here, I create a couple of tables to use for analytics, or OLAP:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
CREATE TABLE dbo.OLAP ( Source_id int NOT NULL PRIMARY KEY CLUSTERED , oltp_data char(100) NOT NULL ) ON [PRIMARY] WITH (DATA_COMPRESSION = PAGE); CREATE TABLE dbo.OLAP_Updates ( SourceSchema sysname NOT NULL , SourceTableName sysname NOT NULL , DestSchema sysname NOT NULL , DestTableName sysname NOT NULL , max_rv_captured binary(8) NOT NULL , CONSTRAINT OLA_Updates_pk PRIMARY KEY CLUSTERED (SourceSchema, SourceTableName) ) ON [PRIMARY] WITH (DATA_COMPRESSION = PAGE); |
The 2nd table, dbo.OLAP_Updates
, supports multiple source and destination tables, one per row, and is used to store the maximum row version we’ve migrated from the source OLTP
table into the destination OLAP
table.
These statements show the state of the OLTP and OLAP tables:
1 2 3 4 5 6 7 8 9 |
SELECT [OLTP] = OLTP.oltp_data , [Rows] = COUNT(1) FROM dbo.OLTP GROUP BY OLTP.oltp_data SELECT [OLAP] = OLAP.oltp_data , [Rows] = COUNT(1) FROM dbo.OLAP GROUP BY OLAP.oltp_data; |
Initially, the output looks like this, since we haven’t copied any rows into the OLAP table:
╔════════════════╦══════════╗ ║ OLTP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 50000000 ║ ╚════════════════╩══════════╝ ╔════════════════╦══════════╗ ║ OLAP ║ Rows ║ ╠════════════════╬══════════╣ ║ ║ ║ ╚════════════════╩══════════╝
The Fun Bit!
This code migrates new rows, as well as modified rows, in an efficient manner, using the MIN_ACTIVE_ROWVERSION()
2 function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
SET NOCOUNT ON; DECLARE @MaxRvCaptured binary(8); IF OBJECT_ID(N'tempdb..#cci', N'U') IS NOT NULL DROP TABLE #cci; IF OBJECT_ID(N'tempdb..#rows', N'U') IS NOT NULL DROP TABLE #rows; IF OBJECT_ID(N'tempdb..#olap_updates', N'U') IS NOT NULL DROP TABLE #olap_updates; -- #olap_udpates is used to facilitate an efficient UPSERT operation CREATE TABLE #olap_updates ( indicator char(1) NOT NULL ); -- #rows stores primary keys for rows that need to be copied into the OLAP table CREATE TABLE #rows ( Source_id int NOT NULL PRIMARY KEY CLUSTERED ); -- #cci is a dummy table used to allow BATCH mode for certain operations CREATE TABLE #cci ( i int , INDEX cci_cx CLUSTERED COLUMNSTORE ); BEGIN TRANSACTION; BEGIN TRY --get the last used rowversion from the previous time this code ran SET @MaxRvCaptured = ( SELECT o.max_rv_captured FROM dbo.OLAP_Updates o WHERE o.SourceSchema = N'dbo' AND o.SourceTableName = N'OLTP' AND o.DestSchema = N'dbo' AND o.DestTableName = N'OLAP' ); --initialize @MaxRvCaptured if this is the first time this code has ran. IF @MaxRvCaptured IS NULL BEGIN SET @MaxRvCaptured = CONVERT(binary(8), 0); END --capture the primary keys for modified rows from the source table into a #temp table INSERT INTO #rows WITH (TABLOCKX) (Source_id) SELECT OLTP.Source_id FROM dbo.OLTP WITH (INDEX (OLTP_rv)) LEFT JOIN #cci ON (1 = 0) --this allows batch mode on row-store tables in 2014+ WHERE OLTP.rv >= @MaxRvCaptured; --delete rows from the destination table that are going to be updated DELETE FROM dbo.OLAP FROM dbo.OLAP INNER JOIN #rows r ON OLAP.Source_id = r.Source_id LEFT JOIN #cci ON (1 = 0); --insert new and modified rows from the source into the destination INSERT INTO dbo.OLAP (Source_id, oltp_data) SELECT OLTP.Source_id, OLTP.oltp_data FROM dbo.OLTP INNER JOIN #rows r ON OLTP.Source_id = r.Source_id LEFT JOIN #cci ON (1 = 0); --capture the MIN_ACTIVE_ROWVERSION() UPDATE dbo.OLAP_Updates SET max_rv_captured = MIN_ACTIVE_ROWVERSION() OUTPUT 'Y' INTO #olap_updates (indicator) WHERE SourceSchema = N'dbo' AND SourceTableName = N'OLTP' AND DestSchema = N'dbo' AND DestTableName = N'OLAP'; --if OLAP_Updates doesn't contain a row for the set of tables we're dealing with, insert one. INSERT INTO dbo.OLAP_Updates (SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured) SELECT * FROM (VALUES (N'dbo', N'OLTP', N'dbo', N'OLAP', MIN_ACTIVE_ROWVERSION())) v(SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured) WHERE NOT EXISTS (SELECT 1 FROM #olap_updates); COMMIT TRANSACTION; END TRY BEGIN CATCH IF @@TRANCOUNT > 0 BEGIN ROLLBACK TRANSACTION END PRINT ERROR_MESSAGE() END CATCH |
The MIN_ACTIVE_ROWVERSION()
function captures the minimum rowversion value currently active in the database. Capturing that value ensures we’ll never miss rows that have been updated in the OLTP data.
After the first run of the migrate code, we see the OLAP table contains all the rows from the OLTP table; the initial run will take a fair amount of resources if your source table has lots of pre-existing data:
╔════════════════╦══════════╗ ║ OLTP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 50000000 ║ ╚════════════════╩══════════╝ ╔════════════════╦══════════╗ ║ OLAP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 50000000 ║ ╚════════════════╩══════════╝
The OLTP table implements a rowversion
column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates
table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.
Note the code above uses the temporary clustered columnstore index, #cci
, to prompt the query processor to use batch mode for certain operations; this is only available in SQL Server 2014+.
Results Time!
In the code below, we’re updating a portion of the 50 million rows, simulating the OLTP system making changes to data as a result of client interaction. 1 in every 1000 rows are updated to B...
, 1 in every 400 rows are updated to ‘C…’, then 1 in every 201 rows are updated to D...
. The results shown reflect running the rowversion-based migration code after each update.
1 2 3 |
UPDATE dbo.OLTP SET oltp_data = REPLICATE('B', 100) WHERE Source_id % 1000 = 0; |
╔════════════════╦══════════╗ ║ OLTP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49950000 ║ ║ BBBBBBBBBBB... ║ 50000 ║ ╚════════════════╩══════════╝ ╔════════════════╦══════════╗ ║ OLAP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49950000 ║ ║ BBBBBBBBBBB... ║ 50000 ║ ╚════════════════╩══════════╝
1 2 3 |
UPDATE dbo.OLTP SET oltp_data = REPLICATE('C', 100) WHERE Source_id % 400 = 0; |
╔════════════════╦══════════╗ ║ OLTP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49850000 ║ ║ BBBBBBBBBBB... ║ 25000 ║ ║ CCCCCCCCCCC... ║ 125000 ║ ╚════════════════╩══════════╝ ╔════════════════╦══════════╗ ║ OLAP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49850000 ║ ║ BBBBBBBBBBB... ║ 25000 ║ ║ CCCCCCCCCCC... ║ 125000 ║ ╚════════════════╩══════════╝
1 2 3 |
UPDATE dbo.OLTP SET oltp_data = REPLICATE('D', 100) WHERE Source_id % 201 = 0; |
╔════════════════╦══════════╗ ║ OLTP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49601989 ║ ║ BBBBBBBBBBB... ║ 24876 ║ ║ CCCCCCCCCCC... ║ 124379 ║ ║ DDDDDDDDDDD... ║ 248756 ║ ╚════════════════╩══════════╝ ╔════════════════╦══════════╗ ║ OLAP ║ Rows ║ ╠════════════════╬══════════╣ ║ AAAAAAAAAAA... ║ 49601989 ║ ║ BBBBBBBBBBB... ║ 24876 ║ ║ CCCCCCCCCCC... ║ 124379 ║ ║ DDDDDDDDDDD... ║ 248756 ║ ╚════════════════╩══════════╝
Consider implementing an ON DELETE
trigger on the OLTP table that captures the pertinent primary key values into a third table, if your OLTP system deletes rows. Something like this should work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
IF OBJECT_ID(N'dbo.OLTP_Deletes', N'U') IS NOT NULL DROP TABLE dbo.OLTP_Deletes; CREATE TABLE dbo.OLTP_Deletes ( Source_id int NOT NULL CONSTRAINT OLTP_Deletes_pk PRIMARY KEY CLUSTERED ); GO CREATE TRIGGER OLTP_delete_trigger ON dbo.OLTP AFTER DELETE AS BEGIN SET NOCOUNT ON; INSERT INTO dbo.OLTP_Deletes (Source_id) SELECT Source_id FROM deleted d; END; GO |
This code will remove the deleted rows from the dbo.OLAP
table that have been deleted in the dbo.OLTP
table:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
BEGIN TRANSACTION BEGIN TRY DELETE FROM dbo.OLAP FROM dbo.OLAP o INNER JOIN dbo.OLTP_Deletes d ON o.Source_id = d.Source_id LEFT JOIN #cci ON (1=0); TRUNCATE TABLE dbo.OLTP_Deletes; COMMIT TRANSACTION; END TRY BEGIN CATCH IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION; PRINT ERROR_MESSAGE(); END CATCH |
Gotchas!
Ensure there is enough CPU and memory in the SQL Server instance to support both the OLTP and the OLAP load. Running both loads in the same SQL Server instance can be difficult. If resource contention on the OLTP system is a primary concern, you may want to use a linked server on the OLAP system to separate resources into two separate SQL Servers. Run the migrate code above on the OLAP system, while monitoring for resource consumption on the OLTP system so you have an understanding of the potential impact.
This method is presented as an efficient alternative to copying rows based on the hash of columns in the source table against a hash of the columns in the destination table. For extreme systems, it may be required to use SSIS or some other bulk-load method of migrating the data from the OLTP system into a staging system to perform the actual rowversion comparison work. If you do that, you’ll need to ensure the destination table in the staging system uses a binary(8)
data type for the rv
column to ensure the rowversion values from the OLTP system are preserved in staging.
Wrap-up!
Instead of comparing rows column-by-column, use the rowversion
data type to efficiently identify rows that have been modified since the last time you looked at them.
Let me know if you have any comments or questions about this code!
1rowversion T-SQL data type
2MIN_ACTIVE_ROWVERSION() T-SQL function
Meagan Longoria has a great post about not truncating your dimension and fact tables on her blog at https://datasavvy.me/2019/07/25/why-we-dont-truncate-dimensions-and-facts-during-a-data-load