Monday, October 13, 2014

mysql prepared statements caching

PreparedStatement is cached inside the J2EE server connection pool manager. The J2EE server keeps a list of prepared statements for each database connection in the pool. When an application calls prepareStatement on a connection, the application server checks if that statement was previously prepared. If it was, the PreparedStatement object will be in the cache and this will be returned to the application. If not, the call is passed to the jdbc driver and the query/preparedstatement object is added in that connections cache.

We need a cache per connection because that's the way jdbc drivers work. Any preparedstatements returned are specific to that connection.

If we want to take advantage of this cache,We need to use parameterized queries so that they will match ones already prepared in the cache. Most application servers will allow you to tune the size of this prepared statement cache.

PHP/MYSQL

So there are good reasons to use prepared statements:
  • Save on query parsing
  • Save on data conversion and copying
  • Avoid SQL Injection
  • Save memory on handling blobs
There are also drawback of using prepared statements:
  1. Query cache does not work
  2. Extra server round trip required if statement used only once
  3. Not all statements can be prepared. So you can’t use prepared API exclusively you’ll need to fall back to normal API for some statements
  4. Newer and sometimes buggy code. I had a lot of problems with PHP prepared statements. It is getting better but still it is less mature than standard API
  5. You can’t use placeholders in place of all identifiers. For example you can’t use them for table name. In certain version it even does not work for LIMIT boundaries
  6. Inconvenient list handling. Unlike in for example PEAR emulated prepard statements there is no nice way to pass list of values to IN
  7. Harder tracing. Logs were now fixed to include full statement text not only “Execute” but in SHOW INNODB STATUS you would still see statements without actual values – quite inconvenient for analyses.
For a prepared statement executed via the binary protocol, comparison with statements in the query cache is based on the text of the statement after expansion of ? parameter markers. The statement is compared only with other cached statements that were executed via the binary protocol. That is, for query cache purposes, statements issued via the binary protocol are distinct from statements issued via the text protocol.

We can't re-use a mysql statement prepared during a previous request in php.

If  PHP application uses connection pooling to the database, and the database caches prepared statements, then yes, the caching will persist between pages. If the prepared statement caching is done by the client library, then this is more nebulous.

Reusing the same variable name in PHP won't invalidate the MySQL prepare "cache".

$stmt = $dbh->prepare("SELECT column_A FROM Table1 WHERE id=?");
$stmt->bindValue(1, $id, PDO::PARAM_INT);
$stmt->execute();

$stmt = $dbh->prepare("UPDATE Table2 SET column_B=? WHERE column_A=?");
$stmt->bindValue(1, $name);
$stmt->bindValue(2, $column_A);
$stmt->execute();

MYSQL persistent connection

Mysql server offer statement caching.

mysql> PREPARE stmt_name FROM "SELECT name FROM Country WHERE code = ?";
Query OK, 0 rows affected (0.09 sec)
Statement prepared


mysql> SET @test_parm = "FIN";
Query OK, 0 rows affected (0.00 sec)


mysql> EXECUTE stmt_name USING @test_parm;
+---------+
| name    |
+---------+
| Finland |
+---------+
1 row in set (0.03 sec)


mysql> DEALLOCATE PREPARE stmt_name;

Query OK, 0 rows affected (0.00 sec)

When a request is served php "cleans" the instance and frees resources and other variables. This is done in several steps. Since apache keeps the process alive after a request not all steps are executed and not all memory is freed. There is e.g. EG(persistent_list) which is used by mysql_pconnect(), pg_pconnect(), ... This list isn't emptied between requests as long as the process keeps alive (could be, depending on the actual implementation, but that would defy the purpose of EG(persistent_list)). If you use persistent connections your script might get a "re-used" connection established during a previous request.
To (re-)use a prepared statement directly you need the identifier for that statement (and that connection). When using (php-)postgresql this is simply a (connection-wise) unique string you pass to pg_execute(), so your script has no problem to gain access to the statement previously prepared by another instance (using the same connection).

Using mysqli or PDO-mysql you need a resource/object as statement identifier. That's kind of a problem since neither the mysqli nor the pdo extension seem to offer a way of storing the resource in EG(persist_list) between requests and you can't recreate it either, so it's seems impossible to re-use a mysql prepared statement directly.

But as we know we should not use persistent connection , here are some good points.

  1. When you lock a table, normally it is unlocked when the connection closes, but since persistent connections do not close, any tables you accidentally leave locked will remain locked, and the only way to unlock them is to wait for the connection to timeout or kill the process. 
  2. Normally temporary tables are dropped when the connection closes, but since persistent connections do not close, temporary tables aren't so temporary. If you do not explicitly drop temporary tables when you are done, that table will already exist for a new client reusing the same connection. The same problem occurs with setting session variables. 
  3. If PHP and MySQL are on the same server or local network, the connection time may be negligible, in which case there is no advantage to persistent connections.
  4. Apache does not work well with persistent connections. When it receives a request from a new client, instead of using one of the available children which already has a persistent connection open, it tends to spawn a new child, which must then open a new database connection. This causes excess processes which are just sleeping, wasting resources, and causing errors when you reach your maximum connections, plus it defeats any benefit of persistent connections.
Also Try to use it smartly. For example.

$ids = array(12,34,56);

  $sql = “SELECT * FROM address WHERE address_id = :address_id”;
  $databaseHandle = CustomDB::getDBH();
  $rows = array();
  $preparedStatementHandle = $databaseHandle->prepare($sql);
  foreach ($ids as $id) {
    $preparedStatementHandle->execute(array(‘address_id’ => $id));
    $result = $preparedStatementHandle->fetchAll();
    $rows = array_merge($rows, $result);
  }

It doesn't make sense when we can get gain using SELECT id FROM address WHERE address_id IN (12,34,56);


Prepared Statements are useful when they are related to session-long caching alongside a database wrapper that’s just good enough to programmatically generate most typical lookups.


Unfortunately, “good enough” caching looks complicated:


Example Queries:


  SELECT * FROM address WHERE address_id = 12;


  SELECT * FROM address WHERE address_id = 34;


  SELECT * FROM address WHERE address_id = 56;


Example Code, assuming PDO & PHP, to produce those queries using Prepared Statements with caching:


  $ids = array(12,34,56);


  $sql = “SELECT * FROM address WHERE address_id = :address_id”;


  $databaseHandle = CustomDB::getDBH();


  Cache::setStrategy(Cache_LRU::getLabel());


  $rows = array();


  if (!Zend_Registry::isRegistered(self::PREP_STMT_CACHE_KEY)) {


    Zend_Registry::set(self::PREP_STMT_CACHE_KEY, Cache::getInstance());


  }


  $preparedStatementHandle = Zend_Registry::get(self::PREP_STMT_CACHE_KEY)->get($sql);


  if (!$preparedStatementHandle) {


    $preparedStatementHandle = $databaseHandle->prepare($sql);


    // Use the sql itself as the index/hash


    Zend_Registry::get(self::PREP_STMT_CACHE_KEY)->set($sql, $preparedStatementHandle);


}


  foreach ($ids as $id) {


    $preparedStatementHandle->execute(array(‘address_id’ => $id));


    $result = $preparedStatementHandle->fetchAll();


    $rows = array_merge($rows, $result);


  } 

Prepared Statements fail
  1. Prepared Statements only exist for the current session, so holding onto a handle after a session closes will lead to failures
  2. Each Prepared Statement consumes a handle from the instance’s Prepared Statement pool, which is the “max_prepared_stmt_count”
  3. Out Of Memory on the client side
Case 1:
A connection closes while a cache of Prepared Statement Handles exists

Solution:
Update your PDO wrapping class to have a __destruct method defined to clear the relevant cache of Prepared Statement Handles before calling the parent’s destruct method.

Case 2:
The max_prepared_stmt_count value is reached on a database

Solution:
Immediately drop all local caches of Prepared Statements and try again. If there is still an issue, activate PDO’s ATTR_EMULATE_PREPARES flag to silently convert calls of ->prepare and ->exec into standard SQL Statements.

Case 3:
Out Of Memory (OOM) on the client-side.

Solution:
Reduce the Prepared Statement Handle Cache size. The cache does not have to be large if it is well managed. Even my company’s complicated webapp’s web requests do not fill a 200-statement-long FIFO cache.

Remaining Points:
  • Monitor (ie: Nagios) Prepared_stmt_count vs max_prepared_stmt_count
  • Monitor (ie: StatsD) the Prepared Statement Handle Cache hit, miss, and purge rate.
  • An LRU’s extra minimal overhead is only worthwhile over a simple FIFO if your Prepared Statement Handle Cache is too small for all the queries that should be cached.
  • Note: Your cache should be small due to the unbounded and invisible memory consumption of Prepared Statement Handles on the database server
  • A best-case Prepared Statement instantiation against localhost with a simple select costs me, on average, about 300 microseconds. Pulling a handle from a cache is about 6 microseconds.
  • Coworkers have shown me that long lists of Named Parameters (ie: “:id1, :id2, :id3, [...]“) get more expensive with quantity whereas long lists of Ordered Parameters (ie: “?,?,?,[...]“) remain cheap even in large number. Numerically quantifying this slowdown will be a future post.
  • Ordered Parameters’ values are not decoded in SHOW PROCESSLIST. Named Parameters’ values are displayed, however, which makes them, to me, far preferable. 



No comments:

Post a Comment