Database Best Practices

This site uses Entity Framework (EF) as a low-code way of accessing the database. Classes are defined in code, and the project codegens the database objects and updates them as the code changes. This is great for keeping the code simple, but since it allows for arbitrary queries (versus a database admin explicitly exposing a limited number of pre-optimized queries), the ASP.net developer has the responsibility to ensure that all data accesses are performant.

So, what does that mean?

A database stores its records in rows, and the rows are stored in some order. A "clustered index" defines this order. It organizes rows by the table's "primary key", which is one or more columns that are used to order the clustered index that contains the full rows. But sometimes you want to to query the table on a different column. You should then create a "non-clustered index" on all the columns that will be part of that seek. You should also "include" (as extra data, not necessarily part of the sort order) the other columns you want to retrieve with that lookup; otherwise your query will have to make a second lookup in the clustered index to gather the extra data fields. In a table with lots of text columns, that might be preferable to storing a second copy of the data in the index. It's a classic time/space tradeoff.

What are concrete recommendations?

When you create a query (which is often a C# linq-to-sql statement like "from...where...select"), understand which tables and fields are part of your "where" clause and check that an index exists on them. Each table can generally only use one index per query, so if you ware filtering TeamMembers for Team.ID and User.ID, you need an index on both columns combined (preferably including all the other columns in the table), rather than one index on each field, which was what existed originally, and which were both ignored for this query. [This turned out to be a bad example - the noted TeamMember fields are themselves classes. The statement to codegen an index doesn't like pulling index data from these sub-objects. I haven't found a workaround to place in PuzzleServerContext, so I had to manually edit the Migrations file to define this index, because once the table exists with the right columns, the index can be specified on those column names rather than the objects they came from. A simpler example is the PlayerInEvent object (the "swag" table) which has an index defined on PlayerId, EventId. ]
When you add a text column to a data class, only use the nvarchar(max) datatype if you will not use that field for index queries. The (max) type is the DB version of a reference type, as opposed to a value type. As the data is not inline in the table, it cannot be indexed, and incurs extra pointer dereferences to other data storage anyway. (Example: the guids storing WLIDs in the PuzzleUsers table were originally nvarchar(max), which meant that every player auth request would scan the entire table looking for the row with that string. It is always a 36-byte string, so that could be the field width. The table that sources that data define it as nvarchar(450) so we switched it to that since in theory other auth ID formats could be supported.
Only load data you really need. A few years ago, the puzzle page (Submissions) started defaulting to showing all previous team submissions. This turns out to be an expensive query, and we should consider hiding this data unless a user clicks to show it, as most of those queries are probably wasted.
Validate your changes to queries and tables by looking at a perf trace of the database afterwards, and observing the "query plan" (that's the diagram of which indexes are used and how they are joined).

How do I look at perf traces?

Just as a C# developer should be familiar with VS/VSCode, anyone working with SQL should be able to use its counterpart, SQL Server Management Studio (SSMS). Install this and use the connection string to connect to the puzzle database server. Expand the database tree node, then the folder "Query Store", and open the system view called Top Resource Consuming Queries. Due to the quirk that SSMS communicates with the server using SQL itself, this query is polluted with some SSMS chatter too, but you'll find the last few minutes of queries, which you can sort by CPU Time or Rows or Duration. You can click down the list of query text and see the query plan at the bottom. You will see (Non/)Clustered Index Seek when an index is being used, or Index Scan when it is not being used correctly. Seek is O(1), Scan is O(n) and generally requires obtaining locks too, which slow things down further.

Pedantic etymology note

In the most commonly accepted database parlance, "indexes" is the proper plural of the thing that is a list/catalog of things. "Indices" are single numbers that summarize a dataset, such as stock market indices (the DJIA and SPX, etc.)

Database Best Practices

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Developer set-up

Development tips and guides

Authors: How to use the site

Admins: How to use the site

Deployment

Resources

Other pages

Clone this wiki locally