Niraj Bhatt – Architect's Blog

Ruminations on .NET, Architecture & Design

Category Archives: SQL SERVER

Pivot, PivotTable and PowerPivot

OK, I am not a data guy. Or let me say I am not an Excel (MS-EXCEL) guy. I have always believed that if you are good at PowerPoint (along with Visio) you probably are an architect and if you are good at Excel you are most likely to be a manager 🙂 . But not knowing Excel well, has always given me a feeling of – “something is missing”. After all, it’s the same data, but the way few people present it, they just make it look so good. While it takes time to develop those skills, understanding Pivot and related features is usually a good step in that direction.

Excel primarily is a spreadsheet tool in which you can dump some data. At times though, data is not everything. For e.g. your retail transactional system captures the sales across product lines, but if a manager needs to predict his stocks for next fiscal based on historic sales, he got to make some sense out of that data. That’s where you got to Pivot the data. The word ‘Pivot’ as Wiki generally describes it – is center point of any rotational system. You have seen them in action while helping your kids on a seesaw. Pivot table on other hand is a data summarization tool found in spreadsheet softwares like Excel. If you have to bridge this tool and the word ‘Pivot’ – you can think you are bending your data to a specific viewpoint using functions like sort, total, average, count, etc. Let’s see it with help of a trivial example. Consider below the sample data in Excel. To bend it to monthly stock view you can click on Insert tab and insert a Pivot Table. This would give you an empty pivot table and a PivotTable field list. Drag and drop your fields to specific areas to bend your data to a viewpoint (N.B. you can have multiple fields for each area and control them via expand / collapse buttons and all your charting knowledge works as is with Pivot tables).

PowerPivot as the name suggests elevates the Pivot table feature to next level making it full-fledged self-service BI tool. An Excel Add-In, PowerPivot connects to almost any external data sources including SQL Server Analysis Services. Once fetched, data stays inside the workbook and using excellent compression techniques the file size on the disk is still relatively small. You can share your PowerPivot workbooks via SharePoint and configure refresh cycles for your workbook to ensure your team always makes their decisions based on recent data. The experience of pivoting though remains same as shown earlier.

Hope that helps 🙂 .

Setting Auto Commit Off SQL Server

In this post I will share a useful tip for SQL Server. While working with a live database, turning Auto Commit Off / Implicit transactions ON can be quite handy, especially for mitigating human errors. This is often the case when you make an unintentional change to your database (I have seen this even in tightly controlled environments) and your only options are to restore a database backup or rollback your changes. Rollback being relatively easy option can be put in place via implict transactions (though by no means it would mitigate all human errors). Implicit transactions can be turned on via simple T-SQL statement or it can be done across all sessions by modifying SQL Server Management Studio Options (Tools -> Options) as shown below

T-SQL:
SET IMPLICIT_TRANSACTIONS ON;

SSMS:

Print Page Alignment – SQL Server Reporting Services

There could be times when you see an SSRS report fit quite nicely on your monitor but things turn out quite different when you actually hit ‘Print’ button. At times this could be due to overlooked sections of report, unnecessary white space, etc. but mostly due to bad printing alignment.
To specify proper print alignment for report one needs to measure its page size. You can do this by Selecting Report Menu -> View -> Ruler

Try to assess the width of your report via Ruler (if you care for the height you can measure that too)

After assessing the width, go to report menu -> report properties -> page setup. Here you can select various options like Paper size, Page Units, Width, Height, Margins, etc. Ensure that the width you specify over here matches your width calculation via ruler and has necessary buffer for margins too. E.g. if the width via ruler indicates that report size is 10 inches, and your margins (Left and right) are one inch each, the width in page setup should be 12 inches as shown below or alternately you may to have to report per se to decrease its width:

Hope this helps!!!

Architecting Data Warehouse

As data in your applications grows it’s the database that usually becomes a bottleneck. It’s hard to scale a relational DB and the preferred approach for large scale applications is to create separate databases for writes and reads. These databases are referred as transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database for reporting purpose, sometimes they don’t quite fit the reporting requirements of an enterprise. These requirements typically are data analytics, effective schema (for an Information worker to self-service herself), historical data, better performance (flat data, no joins) etc. This is where a need for data warehouse or an OLAP system arises.

A Key point to remember is a data warehouse is mostly a relational database. It’s built on top of same concepts like Tables, Rows, Columns, Primary keys, Foreign Keys, etc. Before we talk about how data warehouses are typically structured let’s understand key components that can create a data flow between OLTP systems and OLAP systems. There are 3 major areas to it:

a) OLTP system should be capable of tracking its changes as all these changes should go back to data warehouse for historical recording. For e.g. if an OLTP transaction moves a customer from silver to gold category, OLTP system needs to ensure that this change is tracked and send to data warehouse for reporting purpose. A report in context could be how many customers divided by geographies moved from sliver to gold category. In data warehouse terminology this process is called Change Data Capture. There are quite a few systems that leverage database triggers to move these changes to corresponding tracking tables. There are also out of box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.

b) After we make the OLTP system capable of tracking its changes we need to provision a batch process that can run periodically and takes these changes from OLTP system and dump them into data warehouse. There are many tools out there that can help you fill this gap – SQL Server Integration Services happens to be one of them.

c) So we have an OLTP system that knows how to track its changes, we have jobs that run periodically to move these changes to warehouse. The question though remains is how warehouse will record these changes? This structural change in data warehouse arena is often covered under something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be applied to pure relational tables too. SCD enables a database structure to capture historical data. This would create multiple records for a given entity in relational database and data warehouses prefer having their own primary key, often known as surrogate key.

As I mentioned a data warehouse is just a relational database but industry often attributes a specific schema style to data warehouses. These styles are Star Schema or Snowflake Schema. The motivation behind these styles is to create a flat database structure (as opposed to normalized one), which is easy to understand / use, easy to query and easy to slice / dice. Star schema is a database structure made up of dimensions and facts. Facts are generally the numbers (sales, quantity, etc.) that you want to slice and dice. Fact tables have these numbers and have references (foreign keys) to set of tables that provide context around those facts. E.g. if you have recorded 10,000 USD as sales that number would go in a sales fact table and could have foreign keys attached to it that refers to the sales agent responsible for sale and to time table which contains the dates between which that sale was made. These agent and time tables are called dimensions which provide context to the numbers stored in fact tables. This schema structure of fact being at center surrounded by dimensions is called Star schema. A similar structure with difference of dimension tables being normalized is called a Snowflake schema.

This relational structure of facts and dimensions serves as an input for another analysis structure called Cube. Though physically Cube is a special structure supported by commercial databases like SQL Server Analysis Services, logically it’s a multidimensional structure where dimensions define the sides of cube and facts define the content. Facts are often called as Measures inside a cube. Dimensions often tend to form a hierarchy. E.g. Product may be broken into categories and categories in turn to individual items. Category and Items are often referred as Levels and their constituents as Members with their overall structure called as Hierarchy. Measures are rolled up as per dimensional hierarchy. These rolled up measures are called Aggregates. Now this may seem like an overwhelming vocabulary to deal with but don’t worry it will sink in as you start working with Cubes and others.

To get a simplistic view of a cube you might want to think of it as a relational table – with rows representing ‘measures’ and columns representing ‘dimensions’. Taking our product example the cube can be expressed as a relational table shown below:

Above relational table, shows aggregated data at per day per item level. As you would have observed each row in the table contains a unique set of values. You can further aggregate this data as necessary in real time – e.g. to derive sales for a given item in a given month or sales a given category in given month. So, at which level in dimensional hierarchy we should aggregate data – hour level, day level or month level? The answer is – it depends on your reporting requirements. In this example, if the reporting always happens at a month level, there is no point in aggregating data at day level to associate additional runtime overhead. On the other side, if you plan to drill at an hour level you might have to rebuild the cube including hour as an additional dimension.

Let’s see few other terms that we would run into while talking about data warehouses.

ODS or an Operational Data Store is a frequently misused term. There would be few users in your organization that want to report on most current data and can’t afford to miss a single transaction for their report. Then there is another set of users that typically don’t care how current the data is. Mostly senior level executives who are interesting in trending, mining, forecasting, strategizing, etc. don’t care for that one specific transaction. This is where an ODS can come in handy. ODS can use the same star schema and the OLAP cubes we saw earlier. The only difference is that the data inside an ODS would be short lived, i.e. for few months and ODS would sync with OLTP system every few minutes. Data warehouse can periodically sync with ODS either daily or weekly depending on business drivers.

Data marts are another frequently talked about topic in data warehousing. They are subject-specific data warehouse. Data warehouses that try to span over an enterprise are normally too big to scope, build, manage, track, etc. Hence they are often scaled down to something called Data mart that supports a specific segment of business like sales, marketing, or support. Data marts too, are often designed using star schema model discussed earlier. Industry is divided when it comes to use of data marts. Some experts prefer having data marts along with a central data warehouse. Data warehouse here acts as information staging and distribution hub with spokes being data marts connected via data feeds serving summarized data. Others eliminate the need for a centralized data warehouse citing that most users want to report on detailed data.

COUNT(*) vs. COUNT(ColumnReference)

Most of us are familiar with this but one of my team members recently got stuck with his query. The difference is COUNT(*) includes NULL values while COUNT(column_reference) doesn’t. As an example consider the two below queries for Northwind database that find the number of orders placed by each Northwind customer:

WRONG:
SELECT C.CompanyName, COUNT(*) AS NoOfOrders
FROM dbo.Customers AS C LEFT OUTER JOIN dbo.Orders AS O
ON C.CustomerID = O.CustomerID
GROUP BY C.CompanyName
ORDER BY NoOfOrders

RIGHT:
SELECT C.CompanyName, COUNT(O.OrderId) AS NoOfOrders
FROM dbo.Customers AS C LEFT OUTER JOIN dbo.Orders AS O
ON C.CustomerID = O.CustomerID
GROUP BY C.CompanyName
ORDER BY NoOfOrders

SQL Server Reporting Services (SSRS) Architecture Overview

This is level 100 for people trying to figure out how SQL Server Reporting Services (SSRS) works and what makes it work. Honestly, I haven’t found SSRS architecture explained that clearly. I have given up in past, as the search engines weren’t leading to any easy to understand sources. Few days back I had to give a management presentation on migrating to SSRS. So with my back on wall I had little option but to dive in. During the journey I came across few distilled facts that I am sharing in this blog post. Hope you find them useful 🙂 .

SSRS is an optional package which you can select to install while installing SQL Server. SSRS in turn is made up of number of components. The simplest diagram I could find that describes these components and their deployment was from TechNet

As you can see in above diagram when you install SSRS it creates Report Server Databases in your SQL Server Instance. These databases are ReportServerDB and ReportServerTempDB which are used to store report configurations and other things including Caching, Session, etc. that improvise the overall performance. You have an option of installing other components like Report Manager and Report Server on the same machine where SQL Server instance is running or you can install them on a different server (typical enterprise setup). An important thing to note here is if you opt for latter you would end up paying for 2 or more SQL Server licenses.

As it turns out there are 3 distinguish components of SQL Server Reporting Services:

1) Report Server: It’s an overloaded term. Largely used to indicate a set of components that allow interaction with Report Server database. SSRS provides Web Services (.asmx) which allows LOB applications to interact with Report Server database directly (http://computername/ReportServer/reportservicexxxx.asmx where xxxx is version of SSRS). SSRS 2005 created virtual directories for Report Server & Report Manager (discussed next), but SSRS 2008 leverages the OS level HTTP listener making SSRS independent of IIS. This allows bundling of Report Server & Report Manager within a windows Service ReportingServicesService.exe. The name of this Windows Service is ReportServer. ReportingServicesService.exe functionality also includes report processing, scheduling (auto generated reports), subscriptions (mailers), etc.

2) Report Manager: An ASP.NET web based application (http://computername/Reports) that in turn interacts with Report Server Web Services. As the name indicates Report Manager allows you to manage reports in terms configuring security access, organizing them into folders (non of these folders map to physical directories but are stored as details in Report Server Database), subscribing to them, etc. One can also create reports (see next point as to how) and deploy them to Report Server Database using Report Manager. This is handy for some restricted user / production scenarios, though most developers prefer to do the deploy reports from BI studio. As discussed earlier with SSRS 2008 this component is bundled with ReportServer windows service.

3) Report Designer: There would be few guys in your team whom you may want to designate as Report Designers. Report Designers can design reports using VS.NET Business Intelligence projects (Report Server Project). Report Designers create data sources (normally a shared data source (.rds) that’s used across a set of reports), create the dataset (using queries / stored procedures on top of data source), define relevant report parameters (mapped to datasets for value retrieval via Report data window and can be passed on from .NET applications too), field formats (using properties window with pervasive VB expressions – e.g. formatting a textbox to display currency decimals) and create layouts (e.g. Grouping). Once they are done with designing their reports (.rdl files – described later) they can test (preview) them and publish them via Report Server (this is done by providing the Report Server URL in project properties and SSRS there creates a specific folder for your project). Once published these reports are available for end user consumption. Advance scenarios like interacting with Excel may require a third party product like OfficeWriter.

There are few other important aspects of SSRS which one should be familiar with.

Report Builder is a another tool which is targeted at business users who want to generate custom reports on fly. Report Builder is a ClickOnce application, intuitive and easy to use but doesn’t support all the options available with VS.NET. It’s also possible to install ReportBuilder as a standalone application.

Report Model is the base for report creation with Report Builder. It’s a simplified view of relational database targeted at business users for ad hoc report creation. Report models are created using BI Development Studio (Report Model Project – .smdl files). A report model is built on top of a Data Source View (.dsv) that defines a logical model based on one or more data sources. Models generated mainly consist of entities (relational tables), fields (attributes of a relational table) and roles (entity relationships – 1-1, 1-*, *-1). Models also contain other attributes like aggregate values that would help ease the reporting for end users. Post creation report model has to be deployed in similar way as reports. You can also use Report Model as a data source for generating reports via Report Designer. While it’s easy to deploy Report model from BI development studio, to deploy report model manually e.g. in production requires you to merge the .smdl and .dsv files.

RDL – this is another term you would run into while talking about SSRS. RDL stands for Report Definition language. This is an XML file which stores query information, data source information, etc. which are required to generate report. There is another type of report definition – RDLC (Report Definition Language Client-side) which don’t store any of above configurations. RDLC is a client side component (VS.NET Application Projects) to which you can pass data (e.g. via DataSet) coming from any of data sources. RDLC can be useful for scenarios like implementing custom pagination (SSRS 2005 pagination by default is client side pagination).

SSRS Security is primarily windows based. When a user accesses the Report Manager Application or ASMX Web Services he has to authenticate with a valid domain username / password. On successful login SSRS determines the role of the user (custom or built-in ones like Browser / Content Manager, etc.) and displays only those reports / folders to which user has access. There are few variations in the security implementation I have come across that don’t rely on Windows Authentication. Some projects tend to control role like content manager for pushing reports (.rdl files) to production with help of rs.exe. All users have an implicit role of a Browser and application layer security determines which reports the user should have access to. In case you want go ahead and roll out your custom authentication that flows security all the way down, SSRS allows you that too. If you are generating reports by connecting to remote data sources for accessing images, etc. you might have to configure Unattended Execution Account.

Deploying Reports to Production – This is mainly done in three ways. In restricted production environments rs.exe can be used. In others we can deploy the reports directly from Visual Studio or use Report Manager (discussed earlier). This normally requires to change your Data Sources and Report Server URL in Project properties. There is overwrite property for Data Sources which is normally set to false. This property helps in ensuring that you accidentally don’t overwrite production data sources during your deployment. We can deploy individual reports too in case we have a specific modification.

Integrating SSRS into your applications – I will focus mainly on ASP.NET here. We need a ReportViewer Control found in the toolbox. Drag it and drop it to your ASPX page and in the background it would add reference to – Microsoft.ReportViewer.WebForms DLL. You may need to bundle this DLL with your application package, as mostly in production, Web Server and Report Server would be on different machines. Below is typical markup found in .ASPX page (I have hardcoded report server url for simplicity)

<rsweb:ReportViewerID=”ReportViewer1″ …>
<ServerReportReportPath=”/TargetReportFolder/OpenTicketReport”ReportServerUrl=”http://ReportServerURL/ReportServer”/&gt;
</rsweb:ReportViewer>

One can also pass any necessary parameters as below in the codebehind file

this.ReportViewer1.ServerReport.SetParameters(Param);

In case your Web Server and Report Server are located on different machines, you need to ensure that the Worker Process running the application on web server has access to reports (you can configure the same using Report Manager Security)

Finally here’s a good link to tune SSRS reports.

Inputs/Thoughts/Suggestions/Corrections?

What is .LDF – SQL Server?

I was quite surprised to see search engines throwing very less information on this one, yet it’s a frequently asked question. LDF acronym stands for Log Data File (though could have been called – Transactional Log Data File). This is a must to have file, and is created at the same time the database file (.mdf) is created by SQL Server.

So why do you need it? For keeping .mdf consistent. To cut short I will copy these lines straight from wiki :
“SQL Server ensures that any change to the data is ACID-compliant, i.e., it uses transactions to ensure that any operation either totally completes or is undone if fails, but never leaves the database in an intermediate state. Using transactions, a sequence of actions can be grouped together, with the guarantee that either all actions will succeed or none will. SQL Server implements transactions using a write-ahead log. Any changes made to any page will update the in-memory cache of the page, simultaneously all the operations performed will be written to a log, along with the transaction ID which the operation was a part of. Each log entry is identified by an increasing Log Sequence Number (LSN) which ensure that no event overwrites another. SQL Server ensures that the log will be written onto the disc before the actual page is written back. This enables SQL Server to ensure integrity of the data, even if the system fails. If both the log and the page were written before the failure, the entire data is on persistent storage and integrity is ensured. If only the log was written (the page was either not written or not written completely), then the actions can be read from the log and repeated to restore integrity. If the log wasn’t written then integrity is also maintained although the database state remains unchanged as if the transaction never occurred. If it was only partially written, then the actions associated with the unfinished transaction are discarded. Since the log was only partially written, the page is guaranteed to have not been written, again ensuring data integrity. Removing the unfinished log entries effectively undoes the transaction. SQL Server ensures consistency between the log and the data every time an instance is restarted.”

Are there any other uses of .LDF? Yes they can be used for Log Shipping, a high availability solution.

What are the recommended optimizations for LDF file? LDF could be stored on a separate disk having a dedicated disk controller in high volume scenario. Growth of LDF should be managed as daily maintenance activity as it can easily eat up the available storage capacity. A 2 disk RAID 1 is also a good trade off for LDF is terms of performance vs. failure protection (if you are able to invest money on disks RAID 10 is good for MDF (faster read)).

Hope that helps you get started 🙂 .

Cost Based Optimization (CBO) vs. Rule Based Optimization (RBO)

These terms were brought up in a recent meeting. I decided to dig them out. These are the optimization strategies used by Database engines for executing a query or a stored procedure. They come into picture after a query or Stored Procedure is compiled and is just about to execute (most databases also cache these generated execution plans). Topic of optimization strategies & their differences can be huge one (guess one can write a book on that) but in this post I will try to keep things simple at a definition level. (An analogy here could be you want to travel from destination A to B, & you have several routes to pick up from.)

Rule Based Optimization: This is an old technique. Basically, the RBO used a set of rules to determine how to execute a query. E.g. If an index is available on a table, the RBO rules can be to always use that index (a RBO for our travel analogy can be avoid all routes with speed brakers). As it turns out that this is simpler to implement but not the best strategy always and can backfire. A Classic example of indexing a gender column is shown here in a similar post. RBO was supported in earlier versions of Oracle. (SQL Server supports table hints which in a way can be compared to RBO, as they force optimizer to follow certain path).

Cost Based Optimization: Motivation behind CBO is to come up with the cheapest execution plan available for each SQL statement. The cheapest plan is the one that will use the least amount of resources (CPU, Memory, I/O, etc.) to get the desired output (in relation to our travel analogy this can be Petrol, time, etc.). This can be a daunting task for DB engine as complex queries can thousands of possible execution paths, and selecting the best one can be quite expensive. For more information on CBO I suggest you go through “Inside MS SQL Server 2005 T-SQL Querying”. CBO is supported by most of databases including Oracle, SQL Server, etc.

(N.B. If you find execution plan selected by DB engine is not the optimal one you can try breaking your query into smaller chunks or changing the query logic)

As a programmer you should strive to ensure that cached query plans are used as much as possible. One of the techniques which can get you going is using parameterized queries & this turns out to be important even if you are using an O/R mapper like NHibernate as shown in this post. A related topic with CBO is that of statistics. Statistics determine the selectivity of the indexes. If an indexed column has unique values then the selectivity of that index is more, as opposed to an index with non-unique values. Query optimizer uses these indexes in determining whether to choose an index or not while executing a query. Some situations under which you should update statistics a) If there is significant change in the key values in the index b) If a large amount of data in an indexed column has been added, changed, or removed or the table has been truncated using the TRUNCATE TABLE statement and then repopulated c) Database is upgraded from a previous version. One can use UPDATE STATISTICS / sp_updatestats to update statistics for a table or an index.

I will look forward to hear your thoughts on above.

(P.S. TOAD from Quest is very useful tool if you want a deep dive into execution plans, just feed your query / SP to it and it will provide many alternatives plans suggesting optimizations & indexes).

Snapshot vs. LogShipping vs. Mirroring vs. Replication vs. Failover Clustering

All these SQL SERVER terms were quite confusing for me. Luckily, I got to attend Vinod’s session last Saturday at BDOTNET’s UG meet. So I am going to jot down my understanding of them, & will look forward to read your comments on it.

1) Snapshot is a static read only picture of database at a given point of time. Snapshot is implemented by copying a Page (8KB for SQL SERVER) at a time. For e.g. assume you have a table in your DB, & you want to take a snapshot of it. You specify the physical coordinates for storing snapshot & when ever original table changes the affected rows are pushed first to the the snapshot & then changes happen to the DB. (N.B. There is also something called as Snapshot Isolation Level which is different from Database Snapshot).

Usage Scenario: You have a separate DB for report generation, and want to ensure that latest data for that is available. You can periodically take snapshot of your transactional database.

2) Log Shipping is an old technique available since SQL SERVER 2000. Here the transactional log (ldf) is transferred periodically to the standby server. If the active server goes down, the stand by server can be brought up by restoring all shipped logs.

Usage Scenario: You can cope up with a longer down time. You have limited investments in terms of shared storage, switches, etc.

3) Mirroring which was introduced with 2005 edition, works on top of Log Shipping. Main difference is the uptime for the standby server is quite less in mirroring. Standby server automatically becomes active in this case (through help of a broker server which is called as Witness in SQL SERVER parlance), without having to restore logs (actually logs are continuously merged in this scenario – no wonder it’s called Mirror 🙂 ). Additional advantages of Mirroring include support at .NET Framework level (read no switching/routing code – requires ADO.NET 2.0 & higher) plus some new features like Auto Page Recovery introduced with SQL SERVER 2008.

Usage Scenario: You want very less down time and also a cost effective solution in terms of shared storage, switches, etc. Also you are targeting a single database which easily fits in your disks.

4) Replication is used mainly when data centers are distributed geographically. It is used to replicate data from local servers to the main server in the central data center. Important thing to note here is, there are no standby servers. The publisher & subscriber both are active.

Usage Scenario: A typical scenario involves syncing local / regional lookup servers for better performance with the main server in data center periodically, or sync with a remote site for disaster recovery.

5) Failover Clustering is a high availability option only (unlike others above which can be used for disaster recovery as well) used with clustering technology provided by hardware + OS. Here the data / databases don’t belong to either of servers, and in fact reside on shared external storage like SAN. Advantages of a SAN storage is large efficient hot pluggable disk storage. You might see DR options like Mirroring used quite frequently with failover clustering. Here’s a good article on adding geo redundancy to a failover cluster setup.

You might want to look at the licensing options of SQL Server, various editions available and how they map to above features. You can find this information in detail here.

Hope that helps to some extent 🙂 .