CommitDBA

Database Management Simplified

Ross Group Inc  MemberzPlus  Blogs  |  News  |  Careers  |  Contact    Twitter CommitDBA LinkedIn CommitDBA

CommitDBA Blogs


See What we are Talking About!

CommitDBA Blog


Welcome to the site!

Applying SQL Set Operators

By: Frank Wright, CommitDBA staff Database Administrator

November 11, 2011


Situation

You are a Developer/DBA and handle all sorts of data analysis requests in addition to your normal responsibilities. This includes comparing data from different sources and mining data for correlations and trends. Comparison and analysis of data can frequently be solved using SET operators available with SQL. You’ve been tasked with the following requests and we will review SET operators and how to apply them to these requests.

  • Build a query for a report that summarizes total sales by product by year and include data from the archive order tables in addition to the live order tables.
  • Work with the local college to get an Internship candidate list of MIS students that have a 90 or better average in both Business and Computer Science Classes.
  • Find cities where customer leads reside that are not covered by your company’s current sales force.
  • Compare data between the development and production databases after testing a new version of an application.
  • Mine data for correlation of two products being purchased together for purposes of offering a promotion to increase multiple item sales.

Technical Review

Overview

SET operators, sometimes called vertical joins, are used to combine/compare 2 result sets into a single query and therefore return 1 consolidated result set. A query that contains a SET operator is called a compound query. Each individual result set inside the compound query is called a component query. Here is the list of SET operators supported in ANSI standard SQL

  • UNION [ALL] - Combines 2 component sets into 1 (OR)
  • INTERSECT [ALL] - Returns a set with rows that are common to both component sets. (AND)
  • EXCEPT [ALL] - Returns a set with rows from the 1st set that are NOT in the 2nd set (1-sided XOR) This is the only SET operator where the output changes based on which component set comes first

The inclusion of the ALL modifier retains cardinality in the final result set and its exclusion removes duplicate rows from the final result set (regardless of source).

A single query can contain more than 1 SET operator and therefore more than 2 component queries. Since the SET operators are placed between component queries you will always have 1 more component query than you have SET operators. The SET operator(s) you use and the order you use(when more than 1 is used) them determine the content of the returned result set. In most RDBMS platforms, the precedence of all SET operators is equal and processed in order of occurrence in the query (TOP to BOTTOM). Just as in algebra this precedence can be overridden with parentheses.

Caveats

  • Oracle – The EXCEPT operator is called MINUS in this platform
  • ALL - Almost all RDBMS platforms support UNION ALL, however, it is not very common to find an explicit direct implementation of [INTERSECT|EXCEPT] ALL in commercial RDBMSs. Workarounds utilizing an OLAP row_number() function or aggregate count(*) are available, however, these workarounds are outside the scope of this document.
  • Union Compatibility - Restrictions and requirements of the component sets in a compound query.

    Column counts – The 2 component sets must have the same number of columns.

    Datatypes – The corresponding columns between component sets must be of the same datatype or the datatype of a column in the 2nd set must be implicitly convertible to the datatype of the corresponding column in the 1st set.

    Nulls - Most RDBMS implementations will only auto convert an explicit NULL to a string data type, however, newer RDBMS versions are implementing an auto conversion of NULLs to any datatype.

  • NULLs -- NULLs usually cause a logical comparison operator to evaluate to NULL (ie neither true nor false). In SET operations, however, they are treated as a value and therefore are considered not equal to non-NULLs and are considered equal to other NULLs
  • Column Names – Column Names are derived from the column names and aliases defined in the 1st set of a compound query. Column names in subsequent components need not be the same as they are ignored.
  • ORDER BY – This clause is only allowed at the very end of the compound query (applying to the final result set) and not allowed in any component query.
  • Datatypes – Only Scalar, non-LargeObject, datatypes can be used in SET operations. Arrays, nested tables, LOBs and other large or complex datatypes cannot be used.

Syntax and Examples

Generic Syntax:

SELECT <column set> FROM <table1> WHERE …
{UNION | INTERSECT | EXCEPT} [ALL]
SELECT <column set> FROM <table2> WHERE …

tblEmployee1

tblEmployee2

UNION: Show a combined list of employee names from both tables. Gray Highlights show the record(s) that would only be returned with the inclusion of the ALL modifier

SELECT EmployeeName FROM tblEmployee1
UNION [ALL]
SELECT EmployeeName FROM tblEmployee2

EmployeeName
Jimmy
Joseph
Jenny
Jillian
Jeffrey
Johnny
Josephine
Jeffrey
Jack
Jenny

INTERSECT: Show a list of employee names that are in both tables. In this particular example the ALL modifier wouldn’t affect the returned resultset.

SELECT EmployeeName FROM tblEmployee1
INTERSECT [ALL]
SELECT EmployeeName FROM tblEmployee2

EmployeeName
Jenny
Jeffrey

EXCEPT: Show a list of employees in the 1st table that are NOT in the 2nd table. In this particular example the ALL modifier wouldn’t affect the returned resultset.

SELECT EmployeeName FROM tblEmployee1
EXCEPT [ALL]
SELECT EmployeeName FROM tblEmployee2

EmployeeName
Jimmy
Joseph
Jillian

Solutions & Tips

Request #1: Build a query for a report that summarizes total sales by product by year and include data from the archive order tables in addition to the live order tables.

SELECT p.UPC, p.Description, LEFT(CONVERT(varchar,oh.OrderDate,102),4) as Year
,SUM(od.ExtendedPrice)
FROM tblOrderHeader oh
INNER JOIN tblOrderDetail od ON (od.OrderHeaderID = oh.OrderHeaderID)
INNER JOIN tblProduct p ON (p.UPC = od.UPC)
GROUP BY p.UPC, P.Description, LEFT(CONVERT(varchar,oh.OrderDate,102),4)
UNION ALL
SELECT p.UPC, p.Description, LEFT(CONVERT(varchar,oh.OrderDate,102),4) as Year
,SUM(od.ExtendedPrice)
FROM tblOrderHeaderArchive oh
INNER JOIN tblOrderDetailArchive od ON (od.OrderHeaderID = oh.OrderHeaderID)
INNER JOIN tblProduct p ON (p.UPC = od.UPC)
GROUP BY p.UPC, P.Description, LEFT(CONVERT(varchar,oh.OrderDate,102),4)

This solution utilizes a UNION ALL to merge these two tables together. Now we can have a report that contains data from both the live system and the entire archive. The ALL modifier doesn’t affect the result (assuming the Archive tables contain data mutually exclusive to the live tables) but it affects performance as the RDBMS doesn’t have to perform any filtering (and possibly sorting) tasks.

Request #2: Work with the local college to get an Internship candidate list of MIS students that have a 90 or better average in both Business and Computer Science Classes.

SELECT s.StudentID, s.StudentName
FROM tblStudent s
INNER JOIN tblStudentClasses sc ON (sc.StudentID = s.StudentID)
INNER JOIN tblClasses c ON (c.ClassID = sc.ClassID)
WHERE c.DepartmentCode IN (‘ACC’, ‘FIN’, ‘ECO’, ‘MKT’)
GROUP BY s.StudentID, s.StudentName
HAVING AVG(sc.GradePercent) >= 90
INTERSECT
SELECT s.StudentID, s.StudentName
FROM tblStudent s
INNER JOIN tblStudentClasses sc ON (sc.StudentID = s.StudentID)
INNER JOIN tblClasses c ON (c.ClassID = sc.ClassID)
WHERE c.DepartmentCode IN (‘MIS’, ‘CSC’, ‘CMP’)
GROUP BY s.StudentID, s.StudentName
HAVING AVG(sc.GradePercent) >= 90

This solution utilizes an INTERSECT to return only the common records of those students who had an A average in their Business classes AND who had an A average in their Computer Science classes. These are the students we want to interview for the Internship position. Anytime you need to find entities HAVING certain aggregate values based on different criteria you can always INTERSECT the different criteria.

Request #3: Find cities where customer leads reside that are not covered by your company’s current sales force.

SELECT cl.City, cl.State
FROM tblCustomerLead cl
EXCEPT
SELECT s.City, s.State
FROM tblSalesPerson s

This solution utilizes an EXCEPT to eliminate the cities in which we already have a Sales Person from the list of cities where we might be able do business. This gives us locations where we might need to enhance our sales force. Changing this statement to an INTERSECT and then using it as a subquery to the SalesPerson table would give us the SalesPersons who can currently act on new leads.

From database assessments to complete database management, CommitDBA can assist your company today. For more information about how CommitDBA can work for you – give us a call at (800) 734-9304.



Got database support needs?
Professional and Affordable DBA and Data Services

Contact us Today