It is critical that we manage character data robustly and consistently in an enterprise data warehouse (EDW). This is done, in large part, through the use of an appropriate collation.

Note that many SQL Server databases utilize (largely because it was the default collation for a very long time) the SQL_Latin1_General_CP1_CI_AS collation. But that collation does not support Unicode data (which is critical in modern applications) and is an obsolete collation intended for backwards compatibility to SQL Server 2000 installations.

In this article, I explain why I recommend the Latin1_General_100_CI_AS_SC_UTF8 collation for most databases in an EDW.

Please note that I do not have long-term experience with this collation. So, my recommendation should be taken with a large grain of salt. Hopefully, though, it will help you think through the same question.

Background

A database created without specifying a collation will inherit that of the server. If the server’s collation is not what you intend, create each EDW database with an explicit specification of the pertinent collation with a statement such as:

CREATE DATABASE Solutions COLLATE Latin1_General_100_CI_AS_SC_UTF8;

For a list of all collations available for use with SQL Server:

SELECT * FROM sys.fn_helpcollations();

Choices

To choose a collation, a number of related choices must first be made because that set of choices dictates the collation that is to be used per:

Windows Collation Name (Transact-SQL) - SQL Server | Microsoft Learn

I present here those choices and my guidance for each.

SQL vs Windows

Microsoft recommends using Windows collations instead of legacy collations established for early versions of SQL Server.

I agree and recommend: Windows collation

Collation designator (aka culture/locale)

There are many to choose from, but most US-based organizations and more utilize Latin1_General.

I agree and recommend: Latin1_General

Code page

Assuming you use a Windows collation (per above), the code page is determined by the other choices.

Version

Most of the newer collations were published with v10.0 (aka SQL Server 2008), including those that Microsoft frequently recommends (including the Latin1_General collations).

No Latin collations were introduced in v14.0 (140), so the most recent Latin collations are for v10.0 (100).

I recommend: 100

Case-sensitivity

Most orgs I have supported have utilized case-insensitive collations successfully for years. I don’t know of any reason to change.

I, therefore, usually recommend: CI

Accent-sensitivity

Most orgs I have supported have utilized accent-sensitive collations successfully for years. I don’t know of any reason to change.

I, therefore, recommend: AS

Kana-sensitivity

Most US-based organizations have no need to use Kana-sensitive collations.

I, therefore, usually recommend: none

Width-sensitivity

Most orgs I have supported have never used width-sensitive collations. I don’t know of any reason to change.

I, therefore, recommend: none

Variation selector-sensitivity

Most orgs I have supported have not used variation selector-sensitive collations. I don’t know of any reason to change.

I, therefore, recommend: none

Supplementary character support

To fully support Unicode supplementary characters in built-in functions, one must specify the “SC” option.

Note: This option is inapplicable if a version 140 collation is utilized.

Auptimal recommends: SC

Binary sorting

If an org desires binary sorting (and comparisons) as opposed to the linguistic-driven rules that would otherwise be used for Latin characters, they would specify this option. I do not recall any org that I have supported ever using binary sorting of character data and generally doubt there is good reason to do so.

I recommend: none

UTF-8

Onemay choose to use a UTF-8-enabled collation by specifying this option.

In general, Auptimal recommends use of UTF-8 encoding (as opposed to UCS-2 or UTF-16) encoding because:

· It is fully compatible with basic ASCII.

· It is more efficient (space and transmission) for English systems.

· It has no endian (byte order) problems.

· It is compatible with NULL-terminated strings.

· It is compatible with byte-oriented networks.

· Since SQL Server 2019, CHAR and VARCHAR columns will automatically use UTF-8 encoding, eliminating the need to declare the columns as NCHAR or NVARCHAR (which use less efficient UTF-16 or UCS-2 encodings).

Use of a UTF-8-enabled collation causes standard data types (CHAR, etc) to support Unicode content automatically, thereby removing the need to utilize nationalized data types (NCHAR, etc).

If a UTF-8-enabled collation is utilized, care must be taken when declaring the size of a character column (because UTF-8 is a variable-length encoding).

I recommend: UTF8

Recommendation

Given the above recommendations for each collation-related choice, I recommend the Latin1_General_100_CI_AS_SC_UTF8 collation for most databases in an EDW.