End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Case Sensitive MySQL Searches

MySQL's support for case sensitive search is explained somewhat opaquely in the aptly titled Case Sensitivity in String Searches documentation. In short, it explains that by default, MySQL won't treat strings as case sensitive when executing a statement such as:

SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]';

This simple search to look for contacts whose first name starts with a lower case letter, will return *all* contacts because in the default character set used by MySQL (latin1), upper and lower case letters share the same "sort value".

UPDATE: After many helpful comments from readers, it would seem the term I should have used was collation, not sort value. The documentation for both MySQL and PostgreSQL have lengthy discussions on the topic.

Enough with the backstory, how do I perform case sensitive searches!

The docs say to convert the string representation to a binary one. This allows "comparisons [to] use the numeric values of the bytes in the operands". Let's see it in action:

SELECT first_name FROM contacts WHERE BINARY(first_name) REGEXP '^[a-z]';

There are other strategies available, such as changing the character set being used for comparisons with the COLLATE function. This would likely work better for cases where you had many columns to compare.

SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]' COLLATE latin1_bin;

You can even go so far as to have MySQL switch character sets and collations. But you do have to do this for each database, each table, and each column you need to convert. Not terribly fun.

6 comments:

Anonymous said...

Here's the documentation you're looking for: http://dev.mysql.com/doc/refman/5.5/en/charset-general.html

The sort order you're looking for is referred to as 'collation' in the manual.

If you use a case sensitive collation, you can get your first query to run without any modifications:

mysql [localhost] {msandbox} (test) > show create table contacts\G
*************************** 1. row ***************************
Table: contacts
Create Table: CREATE TABLE `contacts` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`first_name` char(40) COLLATE latin1_general_cs DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1 COLLATE=latin1_general_cs
1 row in set (0.04 sec)

mysql [localhost] {msandbox} (test) > select * from contacts;
+----+------------+
| id | first_name |
+----+------------+
| 5 | aaron |
| 6 | Alfred |
+----+------------+
2 rows in set (0.00 sec)

mysql [localhost] {msandbox} (test) > SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]';
+------------+
| first_name |
+------------+
| aaron |
+------------+
1 row in set (0.00 sec)

If you pick the collation for each table based on the queries you expect to execute more often, this shouldn't present a significant challenge.

Regards,

Fernando.

Gregory Haase said...

I've seen too many apps trying to recover from case-sensitive search issues in the past to want to go there by default.

I've seen apps storing values wrapped in UPPER() so that they can wrap search terms in UPPER() and get a match

I've seen apps with function based indexes on UPPER(column_name).

And I've seen plenty of instances of:
WHERE upper(search_term) = upper(column_name) where the developers can't understand why an index on column_name isn't being used.

I've also had to run my share of looking for records where UPPER(column_name) <> column_name to fix records that were stored incorrectly.

I'd rather have a collation I can set on the one table where I actually need case sensitive search - and have everything else work correctly by default.

Brian Buchalter said...

@Fernando: Thanks for pointing out that documentation. It's certainly much clearly than some of the other options out there. I agree that hopefully, you are able to create tables with the character set and collate options you want, but often times we inherit systems which just went with MySQL's defaults. Such is life! But as I said, it is possible to change it. I'm sure there are some automated tools available for such things.

Brian Buchalter said...

@Gregory: Yes, things can get ugly quick. There are also cases when developers bring massive amounts of data into application memory to do sorting and other manipulation best handled by the database. It's important to know the right tool for the job and how to use it.

Anonymous said...

Brian,

To answer 'That was one of the few authoritative references I could find to a character set's "sort value", so perhaps there is a more technical term for it. I'd love to hear it.'

First try the PostgreSQL manual at http://www.postgresql.org/docs/9.2/static/collation.html then move on to http://dev.mysql.com/doc/refman/5.6/en/charset-general.html which explains:

"A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: “A”, “B”, “a”, “b”. We give each letter a number: “A” = 0, “B” = 1, “a” = 2, “b” = 3. The letter “A” is a symbol, the number 0 is the encoding for “A”, and the combination of all four letters and their encodings is a character set.

Suppose that we want to compare two string values, “A” and “B”. The simplest way to do this is to look at the encodings: 0 for “A” and 1 for “B”. Because 0 is less than 1, we say “A” is less than “B”. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters “a” and “b” as equivalent to “A” and “B”; (2) then compare the encodings. We call this a case-insensitive collation. It is a little more complex than a binary collation. "

You might then move on to http://dev.mysql.com/doc/refman/5.6/en/adding-collation.html and finally http://dev.mysql.com/doc/refman/5.6/en/adding-collation-simple-8bit.html for a much more in-depth description of how to create the internals of a collation.

You can get no cost help from people familiar with MySQL in the #mysql channel. Similarly for the PG equivalent. People in either would have been able to help you quickly.

Your company may have its own chat channel somewhere and asking an expert there would probably also have got you a fast answer, because it's a normal part of introductory courses to cover this.

You might also consider taking a course or two, since it can save you a lot of time later. Or buy the MySQL Certification Study Guide and read it from cover to cover. It provides an excellent introduction to a broad range of topics and you might then consider going on to get certified.

Views are my own, for my employer's opinion, consult a PR person.

James Day, MySQL Senior Principal Support Engineer, Oracle

Brian Buchalter said...

@James: Thanks for the links, I've updated my post. I used the phrase "sort value" because that's the language found in the Case Sensitivity in String Searches documentation.

Perhaps it would be helpful to update the document to provide the references to collation?