May 6, 2008

Fun with generate_series

A few months ago I attended the PostgreSQL conference in Portland, OR. There were a lot of talks ranging from hard-core stuff like Neil Conway’s talk about the internals of query execution, to random fun stuff like David Fetter’s discussion of procedural languages, including LOLCODE.

gulakuno.jpg

During their talks, a few people mentioned a handy function called generate_series. It took me a while to discover how useful this function really is. I thought I’d post an example. Here goes…

Let’s say that you have a table with sales information:

postgres=# select * from sales order by date;
    date    | sales_person | part_number
------------+--------------+-------------
 2008-05-05 | Glenn        |           1
 2008-05-05 | Shahaf       |           1
 2008-05-06 | Mike         |           1
 2008-05-06 | Mike         |           2
 2008-05-08 | Glenn        |           1
 2008-05-08 | Shahaf       |           1
 2008-05-08 | Mike         |           2
 2008-05-09 | Mike         |           1
 2008-05-09 | Glenn        |           1
(9 rows)

You might want to get an idea of how many sales happened on each day. You could try to do it with a query like this:

postgres=# select date, count(*) from sales group by date order by date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-08 |     3
 2008-05-09 |     2
(4 rows)

This basically works, but it hides one important fact — no sales happened on May 7.

To fix this, we can use generate_series. When you run this function normally, it just generates a series of numbers:

postgres=# select generate_series(0,4);
generate_series
-----------------
0
1
2
3
4
(5 rows)

However, you can easily change it to generate a series of dates:

postgres=# select generate_series(0,4) + date '2008-05-05' as date;
date
------------
2008-05-05
2008-05-06
2008-05-07
2008-05-08
2008-05-09
(5 rows)

Once you have this, you can now join against the sales table to generate the report:

postgres=# select series.date, count(sales.date)
from (select generate_series(0,4) + date '2008-05-05' as date) as series
left outer join sales on series.date=sales.date group by series.date
order by series.date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-07 |     0
 2008-05-08 |     3
 2008-05-09 |     2
(5 rows)

The trick here is to do a left-outer join betwee the date sequence and the sales, and to count the rows that have a non-null sales date.

This is by no means the only use of generate_series, it’s just the most recent use I found. If you know of other ways to use this function, or if you know of other handy functions, drop a note below.

Image credits: Electric Vehicle Guide


March 19, 2008

Syndicate Redfin Listings in WordPress

Our very own PM, Matt Goyer, recently setup his Seattle condos WordPress blog to include a syndication of new Redfin listings. We thought it would be a great idea to share how we did this.

Required Plugins

Steps to Follow

  1. Install the plugins listed above
  2. Create a writable cache folder if you haven’t already (wp-content/cache)
    • This problem is solved if you use WP-Cache
  3. Perform a search on Redfin
    • To syndicate new listings, make sure you select the “New Listings” option (pictured below)
  4. Once you are satisfied with your search, click the “Save this Search” link above the map
  5. In the pop-up that will appear, click the “RSS 2.0 Feed” link (also pictured below)
  6. Follow the SimplePie Plugin usage instructions to include the Redfin RSS feed

new_listings_box.PNG

rss_link1.PNG

Here’s an example of how syndication looks on Matt’s blog:

matts_example.PNG

You’ll have to fiddle with the SimplePie templates and settings to get syndication to look like Matt’s.  Again, refer to their usage instructions on how to do this.


March 10, 2008

Plumbing Upgrade: We’re Using Bricolage For Content Management

logo.png Believe it or not, but up until the beginning of March this year, we were not using a Content Management System (CMS). This post will take you through the reasons for needing a CMS, along with the process we went through to choose a CMS.

We needed a CMS for the following reasons:

  • We used to have to release a whole new version of our website to include content changes. With the CMS we’re able to just release content, without having to change any code at all.
  • We used to have a pseudo-CMS that required a lot of coordination at deploy time between staging and production. Moreover, we needed to be very careful that production changes weren’t overwritten when staging changes were copied over at deploy time.
  • Our developer bug lists used to be cluttered with content bugs.
  • Our marketing personnel could not directly change content, because they don’t have a build on their local machines and they don’t know HTML. Instead they had to hand off a content, layout, and design spec to a developer.

With all of these pain points in mind, we created some requirements:

  • Customizable workflow
    • Scenario: marketing submits content, PM approves content, then PM schedules the content for a release
  • Input forms for content contributers
    • Example: a developer defines a form schema for a particular abstract data type (e.g. a real estate agent or a press release)
    • A marketing contributor would then fill out this form without having to write any HTML
  • Output templates and static file generation
    • When content is submitted, templates get triggered to convert the abstract content into static files (HTML, XML, JSP, etc)
  • Staging preview capabilities
  • Mutli-site delivery
    • Some of our content goes to www.redfin.com and some goes to seattle.redfin.com, sfbay.redfin.com, etc
    • This content should all be housed in one repository
  • Multi-server deployment
  • URL Control
  • Versioning, release branches, and rollbacks

We chose some open source candidates:

We started off by looking at open source platforms, and we considered an insane amount of platforms by looking through a cms matrix. After some preliminary research, we narrowed our scope to a few candidates:

We found out right off the bat that Alfresco was the only enterprise-level CMS of the group. Drupal, Joomla, Mambo, and Plone were all lacking static content generation and URL control. They were also not capable of deploying to multiple web servers, hosting multiple sites. After a lot of fiddling, reading, and talking to support, we learned that Alfresco was not able to output static files in certain cases. For example, assume we have a collection of press release HTML files and a single page that lists each of these press releases. Alfresco required us to write our own JSP code using their API to create a dynamic list of press releases - they didn’t provide a template engine that could create a static list. We later learned that Alfresco is now able to create static files in cases such as the above, but we learned about this too late in the game.

We considered some proprietary candidates:

At this point we started researching proprietary CMS platforms such as Interwoven, RedDot, and CrownPeak, only to find that most of these services are insanely expensive. We needed something open source, but we thought we had considered all of them. During lunch one day I began ranting about my open source CMS frustration, when out of nowhere our Linux/data center operations magician says, “Why don’t you use Bricolage?” We took a look at Bricolage.

We found Bricolage!

It turns out that Bricolage has most of what we want. It uses a Mason templating engine that allows for static file creation, and it basically meets most of our other requirements.

Here’s a list of the things that it does very well:

  • Bricolage’s data modeling methods are insanely powerful. Not only can you define different types of pages (they’re called stories in Bricolage), but you can also create HTML components (sub elements in Bricolage) that can be included in pages in any way. For example, you can define a bulleted list, a link, etc, which makes data entry for non-technical folks very easy.
  • Similarly, Bricolage’s templating engine is insanely powerful. Each sub element only needs one template, and templates can be strategically chained to allow for very efficient template design.
  • Deployment is simple and powerful. Bricolage offers a well-documented API that makes deploying to multiple sites on multiple servers seamless.
  • Bricolage allows you to control URLs for each story by either putting the story in a particular category or by changing the slug of the story.

Here’s a list of the things that it does not do well:

  • Bricolage’s admin panel is totally usable but very clunky.
  • The mechanism for working with different branches (for example, one branch of production content and another for soon-to-be released content) is unnatural. Instead of being able to commit changes into different branches, you have to create a network of cloned stories and bucket these clones in different categories (ex: / is trunk, /dev/2.3 is 2.3_dev_branch, etc).
  • Bricolage is mostly geared for managing story-based sites such as MacCentral, so sometimes its naming conventions are confusing for non-news sites such as Redfin.
  • While the API is fairly well documented, discussions of the larger concepts and best practices are spread over multiple sites, mailing lists and power point presentations. This makes the learning curve fairly steep.

We’re very happy with our choice to use Bricolage. Our website is much more agile, and our engineering team is very happy to have our marketing team take over the deployment of new content.

Expect a post very soon talking about the details of how we implemented Bricolage.


March 4, 2008

How to search Redfin directly from IE and Firefox

I recently heard that some loyal Redfin customers were using Google to do address searches. That’s a shame, since Redfin does a pretty decent job of searching for addresses, MLS IDs, cities, etc. I wanted to see if I could help those power users get to Redfin search results with fewer clicks.

Naturally, I wanted to see if I could make a Redfin search available. It was surprisingly easy- I didn’t have to change any Redfin code. I just wrote an XML file, hosted it on a Web server, and put a link into an HTML page (this page, in fact!)

If you’re using Firefox or Internet Explorer 7, you should be able to enable Redfin search by choosing the relevant option in the search dropdown while viewing this page. Here’s a screenshot of Firefox:

add redfin search to firefox

Here’s how it looks in IE7:

add redfin search to ie7

Once you’ve added Redfin, you can do address searches on Redfin using the search control:
search with redfin

Another way to do searches in Redfin is to use a keyword bookmark. In Firefox, make a new bookmark, and edit it to look like this:

redfin keyword bookmark

The important parts are the location and the keyword. The location got cut off in my screenshot, but it should be http://www.redfin.com/stingray/do/listings-search#search_location=%s

(the only tricky part is the “%s” part, which will get replaced with whatever you search for.)

To use it, you type the keyword and the search terms into your location bar, like this:

search for headquarters

or this (for an MLS):

search for mls


February 13, 2008

MySQL to Postgres

ramona1.JPG

A couple of weeks ago we released an upgrade to the Redfin site. Some of the changes were obvious: we now support Safari, we show polygons for different region types on the map (cities, neighborhoods, zip codes, etc.), we show price history on the listing details page, we allow users to download search results as an EXCEL sheet, and so on. And then there are a bunch of changes that are not so obvious, things that happened behind the scenes. Most notably — we switched our database from MySQL to Postgres.

For a website, the database is basically the lowest layer in the stack, so changing it is not a simple thing to do. My colleague Michael Smedberg wrote about the “toe dip” we took in this direction in a prior release: we still used MySQL as our “gold” database, but we replicated some of the data to a Postgres DB in order to support certain past-sale queries that were underperforming on MySQL. When that effort proved successful, we decided to go full throttle and make the switch to Postgres 100%, which is what we just did.

I can say that so far we’re pretty happy. Perhaps the two most important benefits we’ve gained are: 1. with PostGIS we can support true polygon searches, and 2. Postgres’ row-level multi-value concurrency control (MVCC) means that we can run “heavy” import operations on a DB that is serving live queries without worrying about locking or performance loss. And of course there are other benefits as well.

But in this blog post I thought I’d talk about the gotchas we faced along the way as we transitioned from MySQL to Postgres. If you’re contemplating your own switch, hopefully this will save you some time. Here goes…


MIGRATING THE DATA

dance_with_me_by_ramona.JPG

Some of our first steps in migrating to Postgres were to define the new Postgres schema, to create an empty database with this schema, and to migrate data over from a the MySQL database.

In doing this, we discovered that the data types supported by MySQL are not all supported by Postgres. For example: double, tinyint, enum, and datetime don’t exist at all. There are often reasonable alternatives (e.g. timestamp instead of datetime), but in some cases you have to resort to a data type that is not quite the same (e.g. smallint instead of tinyint). Note: we are using Postgres version 8.2; some of these types (e.g. enum) are available in the recently released version 8.3.

So, in migrating our data we had to do some translation/mapping of our data into the new schema. We figured that this was also a golden opportunity to do some additional cleanup: to remove unused tables, to remove unused columns, to name columns more consistently, to add some constraints, to remove duplicate data, etc.

We needed a migration tool that was flexible enough to handle all these needs, and we ended up writing our own — a java program we call MySqlToPostgres. We considered other alternatives, such as using the Postgres COPY command to bulk load from file. But our own migration tool ended up being a good choice given all the transformations we wanted to do in-transit.

If you’re going to write your own tool, here are a few tips to make the migration faster:

  1. Make sure to start with a target database that has tables but has no indexes or unique constraints; add those only after you migrate your data, it makes a huge difference.
  2. Migrate the data using a few big transactions, instead of many small transactions. In our case, we issued a “begin” before inserting any rows into a given table and a “commit” after the last row was inserted. In between, we issued giant “insert” statements, each of which contained many rows. Here’s an example of a single insert statement with two rows:
     INSERT INTO films (code, title, did, date_prod, kind) VALUES
         ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
         ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
  3. There’s a setting in the postgresql.conf file you can use to log slow queries. Normally we set it to 200, so that we’ll know about queries that take more than 200 milliseconds: log_min_duration_statement=200. However, while doing our migration, it was important to disable this option by setting it to 0. Otherwise our slow query log grew very large with all those multi-row insert statements.

Most of our data is stored in “regular” columns: numeric, string, etc. However, we have one table with some binary data, and this data proved particularly challenging. The Postgres documentation tells you how to format binary data as a string in an insert statement. The trick is that while MySQL and Postgres both treat a byte as an unsigned value between 0 and 255, Java treats bytes as signed values between -127 and 128. That means that if you write your own Java migration tool, you have to do some tricky arithmetic with the “bytes” that Java gives you to convert them into the kind of bytes that Postgres expects.


CONNECTIONS

ardila1.JPG

With most databases, you’re allowed to connect to one database and then issue the use command to start dealing with a different database. Not so with Postgres. When you establish a connection to Postgres it’s to a specific database, and the “use” command doesn’t exist. This also means that you can’t issue queries that span multiple databases.

During the development cycle, especially in tests, it’s helpful to create a fresh database, do some stuff with it, and then drop it. Most databases will prevent you from dropping the database if there are users connected to it, but Postgres is particularly draconian about it. There’s no easy way to say “just drop it anyways and kill all connections”. So, you have to be careful to close all your DB connections at the end of the test, before trying to clean up the DB. Of course, this is good practice anyways, but it’s sometimes tricky to discover who’s holding onto that one connection.

Postgres has a setting that defines the maximum number of connections allowed to the DB: max_connections. By default, this is set to 100. You need to make sure that this value is big enough to support your worst-case situation. In particular, your website probably has some connection pooling mechanism on each web server, and this connection pool probably has a cap. You need to make sure that the DB can handle the case of each web server asking for the max number of connections. You should also check whether your connection pool is configured to reap unused connections, although you shouldn’t count on reaping to keep you below the worst-case scenario. If you find that there’s a mismatch, there are two things you can do: reduce the connection pool sizes all around, or increase the DB connections cap. If you want to do the latter, be aware that this will at least require restarting the DB. In some cases it may even require restarting the server completely. So, it’s best to decide ahead of time how many connections you’ll need in the worst case and to configure all settings accordingly.


SEQUENCES

halloween_by_arum.JPG

There are many situations where it’s convenient to have the database generate ID’s for new records. In MySQL, this is a pretty simple matter: you set the AUTO_INCREMENT attribute on your column and it automatically picks up the next value.

With Postgres it’s a little trickier. The AUTO_INCREMENT keyword doesn’t exist. Instead you have sequences. You can create a sequence and call nextval(’my_sequence_name’) to retrieve the next value. You can also associate this sequence with a column by specifying that the default value for the column is nextval(’my_sequence_name’). There’s also a shortcut for defining all this: serial.

Having these independent sequences in Postgres gives you the flexibility to do interesting things, like to associate the same sequence with multiple tables. But in practice we haven’t found this to be particularly useful. Also, it was a little more challenging to configure our O-R mapping layer, hibernate, to recognize and use these sequences. It involved adorning our classes with the @SequenceGenerator tag, and our ID columns with the @GeneratedValue tag.

Finally, it was important to explicitly initialize these sequences to the next available value after migrating data from MySQL to Postgres.


CASING

desi3.JPG

In MySQL, and in may other databases, string comparisons are case insensitive. For example, on MySQL you would get one result for each of the following queries:

 create table names(firstname varchar(10));
 insert into names values ('Shahaf');
 select * from names where firstname='Shahaf';
 select * from names where firstname='SHAHAF';
 select * from names where firstname='sHaHaF';

Postgres, on the other hand, is case-sensitive. That means that only the first query would return a result.

After some research, we found three ways to deal with this problem:

  1. Use the upper() or lower() Postgres functions to normalize casing before comparison
  2. Use the case-insensitive regular expression operator: ~*
  3. Use the case-insensitive string-matching operator: ilike

The problem with #2 and #3 is that they don’t support index usage in some or all of the cases. The only way to get index usage (and therefore better performance) is to go with the upper() or lower() approach. By the way, we didn’t find any difference between upper() and lower(), so we arbitrarily went with upper().

Of course, it’s a pain to have to actually store your data in all upper-case. You might want to display it with normal casing, and you don’t want to have to store two copies. Luckily, you can store your data using normal casing, but create a functional index on the upper()-cased values. For example:

 create index employees_name on employees(upper(name));

After that, you just need to remember to upper-case any arguments you get, and to always construct queries that compare against upper(name). If you accidentally generate a query that compares against name (without the upper()), then the index will go unused and you’ll see bad performance.


TEXT SEARCH

rice_field_by_desi.JPG

Text search is one of those areas where MySQL has a basic search solution that works and requires little fuss, whereas Postgres has a myriad of solutions that require more TLC. One of these solutions, TSearch2, is now included in the 8.3 version of Postgres, so perhaps it’s becoming a kind of standard.

TSearch2 is built completely on top of Postgres. To use it, you need to add a set of objects to your database: functions, tables, etc. If you want to be able to search some text columns in a given table, you need to add a new column to that table, a column of type tsvector. You need to bulk-populate it using a tsearch2 function with the raw data from your string columns. And you need to define some triggers to keep this tsvector column updated whenever the source columns change. You can then define an index on this tsvector column, and start issuing text search queries, again using some special tsearch2 functions.

We’ve had reasonable success with TSearch2 so far, but we haven’t exploited all its functionality so far, and we’re still considering other approaches (e.g. Lucene).


REPLICATION

elephantart2.JPG

Replication is much like text search: MySQL has a basic replication solution that works, and Postgres has a bunch of competing solutions that are far more complex to implement. However, as with TSearch2, there’s one solution that’s becoming a kind of standard and is now included in the 8.3 version of Postgres: Slony.

Slony is a single-master, multi-slave, asynchronous replication system. In other words, you update your data in one master DB only, but you can query the data from the master or from the slaves. Just beware that the data on the slaves may be a little stale. In that regard, Slony is very similar to MySQL replication. However, the similarities pretty much end there.

MySQL replication is a built-in part of the database, and it uses its own binary logs to record what changes need to be replicated to which slave. Slony is built completely on top of the Postgres database. It creates a set of tables and triggers in your master database to keep track of which changes need to be replicated to which slave. So, in a way, Slony “pollutes” your schema. If you take a database snapshot, you now have all your tables plus a bunch of Slony stuff. On the other hand, Slony also creates triggers on the slaves to prevent you from accidentally making changes there, something that MySQL will happily let you do.

MySQL replication assumes that you want all databases on the master server to be replicated to each of the slave servers. Slony takes a more fine-grained approach. You need to explicitly specify which set of tables from which database on the master you want replicated to which target slave database. You don’t have to include all tables, and the slave database can live on the same machine as the master if you want.

MySQL replication automatically includes all kinds of changes: inserts, updated, deletes, truncates, table creation, index creation, etc. Slony is far more limited. It can only
replicate changes to tables and sequences. If you have schema changes to make, you need to orchestrate them carefully such that they happen on all databases at the same point in the transaction order.

Slony has it’s own SQL-like language that is consumed by an interpreter called slonik. So, you have to invest a little bit of time learning this language and how to use it to define all the “first-class citizens” of Slony replication: clusters, nodes, sets, subscriptions, etc.

By the way, we found it very convenient to add a table used strictly to track replication latency:

 create table replication_heartbeat(
   id int primary key,
   last_updated timestamp
 );
 insert into replication_heartbeat values(1, now());

We set up a cron job that updates the timestamp on the master every minute, and we set up Nagios alerts both on the master and on the slaves to tell us if the timestamp is older than some threshold. There are various other tools out there to monitor Slony, but we found this scheme to be simple and effective.

Once you have replication set up, watch out for big transactions on the master database. For example, let’s say you issue this query:

 update my_big_table set my_column=null where my_column=0

This statement may take a while to run. Until it completes, no changes will be replicated to the slaves. Once it does complete, though, the changes will be replicated to the slaves one row at a time:

 update my_big_table set my_column=null where id=0;
 update my_big_table set my_column=null where id=1;
 etc.

This takes much longer, and it can cause your replication to fall behind.

Also, be careful to match the horsepower of your master and slave databases. If you have a beefy master database server and a dud slave server, the slave may fall behind as soon as you apply any “significant” changes on the master.


TOOLS AND DOCUMENTATION

make_me_fly_seng_wong.JPG

With MySQL we used the free command-line tool (mysql) as well as a commercial GUI tool (SQL Yog). Both worked well for our use-cases. In particular, it was nice to be able to change data directly in SQL Yog’s result view, much like you would in EXCEL.

Postgres comes with a similar command-line tool: psql. It’s very similar to the mysql command-line tool, but it has it’s own esoteric syntax for various things that simply takes time to learn. For example, instead of “show tables” you use “\d”.

We haven’t found a good replacement SQL Yog yet. We’ve used one free tool, pgAdmin III, but we found that it’s not quite as good. It doesn’t let you edit data in the result view, it tends to open a lot of windows, and it tends to keep database connections open more than it should.

We’ve also found that Postgres documentation is not quite as good as MySQL’s. We’ve gotten by so far, but we definitely have to do more Google’ing for various details that are not fully explained in the core Postgres docs.


So there you have it. I hope these tips help you out.

If you also made the switch to Postgres, we’d love to hear about your experience. Did you run into the same issues? Are there others I forgot to mention? Have you found any useful Postgres tools that we should know about? If you considered Postgres but decided against it, we’d love to know your reasons.

Image credits: Bali Adventure Tours


January 31, 2008

Greasemonkey Hack To Show Walk Score

A number of users have asked us to add a Walk Score widget to our details page and unfortunately we’ve been busy working on a few other things.

Fortunately a crafty Redfin Forums user created a Greasemonkey script for Firefox to add a link to Walk Score on our listing detail’s pages.

Walk Score

To install the widget:

  1. Install the Firefox plugin Greasemonkey
  2. Restart Firefox
  3. Install the Redfin Walkscore Greasemonkey script
  4. Find some Redfin listings and click the link that appears below the address

What other Greasemonkey scripts would be interesting?


November 5, 2007

Elephant versus Dolphin: Which is Faster? Which is Smarter?

Elephant vs.MySQL

Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.

GIS Indexes

Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.

How We Did It In MySQL

In MySQL, the queries might look something like:

SELECT
    *
FROM
    listings
WHERE
    price <= 400000 AND
    num_bedrooms >= 2 AND
    num_bathrooms >= 1.5 AND
    type = 'condo' AND
    MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col)
LIMIT
    101

where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:

  • When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
  • When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
  • When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
  • When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it’s found 101, it can give up)

However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:

  • The table is large (tens of millions of rows)
  • The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
  • MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint

This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She’s looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she’s searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for “fixer” properties that have a low asking price and large living area. Again, this results in a query that’s tightly constrained by some criteria, but relatively loosely constrained by geography.

Postgres and PostGIS

Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.

Other Considerations

MyISAM and Data Corruption

In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.

Replication

We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.

Advanced Features

Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.

Conclusion

Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.
elephant_lift


October 22, 2007

West by West North

North by Northwest

Listing alerts is a Redfin feature that lets you keep track of new listings in your area. You start by doing a search on our map page, and then you click on the “Save this Search” link:

Save this Search

This lets you assign a name to your search and save it. You can also sign up to get a daily email with new listings that match your criteria.

We recently released an upgrade to the Redfin website. There were a few minor tweaks to the listing alerts code, but nothing significant. Still, a few days later we got an email from a customer saying (to paraphrase): I used to get an email alert almost every day, but recently they stopped, what’s up?

Being the owner of the Listing Alerts feature, my heart skipped a couple of beats. Then I jumped in to analyze the problem. What I noticed was that indeed this customer (Norm) had a steady stream of emails that mysteriously ended. However, the emails didn’t stop on the day of the release; they stopped several days later. I also noticed that Norm made changes to his saved searches right around the time the emails stopped. So, my hypothesis was that Norm’s new searches were somehow more restrictive and we simply didn’t have results for them. To verify this, I did a test run using two of Norm’s searches, an old search and a new search, for the last 24 hours of data. I found that the old search returned results and the new one didn’t, which seemed to confirm my suspicion. So I went ahead and replied to Norm saying: please relax your criteria and try again. Luckily for me, Norm is a stubborn guy. He replied a few minutes later saying: actually, I think my new search is more broad than the previous one.

Now I was really worried. I took another look and I found the smoking gun: for all saved searches created since our recent release, the search area parameter was not saved correctly. To explain what was wrong about it, I need to give some background on how we store your saved search…

Our saved_searches table was changed a little in the recent release. Prior to the release, it looked like this:

[saved_search_id] [name] [min_latitude] [max_latitude] [min_longitude] [max_longitude] [other criteria…]

After the release, it looked like this:

[saved_search_id] [name] [polygon] [other criteria…]

Why did we make this change? Well, this release was also Redfin’s “toe dip” towards changing our database platform from MySQL to Postgres. We found that Postgres gives us better performance on certain kinds of searches (e.g. past-sales searches). So, in this release we replicated our past sales data from MySQL to Postgres and changed the website to issue all past-sales queries to Postgres. Now, to get the performance benefits from Postgres, you have to store your regions in a column of type GEOMETRY, as opposed to simple columns of type DOUBLE or whatnot. So, we wrote the code to store and fetch GEOMETRY types. And once we had it, we figured we may as well use it for saved searches as well — the search area is just a POLYGON after all.

So where’s the problem? The problem is that while in MySQL coordinates are typically stored as (latitude, longitude), Postgres prefers to store them as (longitude, latitude). Or, to be more specific, Postgres doesn’t particularly care which order you use, but the Java library we used (Vivid Solutions) opts for the (longitude, latitude) style. Why does it do that? Probably because it thinks about these coordinates as points in X-Y space, where X is like the longitude (east/west = right/left), and Y is the latitude (north/south = up/down).

We actually noticed this discrepancy when we were close to our code-complete milestone, and we had to make a decision about it. One option was to standardize — to make all coordinates in both databases follow the same (longitude, latitude) style, the style we will eventually adopt when we move to Postgres 100%. Another option was to keep keep MySQL coordinates (latitude, longitude) and to only use (longitude, latitude) in Postgres. We chose the latter approach because it was less disruptive and we were running out of time. Only we didn’t get it quite right — we missed the code used for the saved searches table. The effect was that all existing saved searches kept their (latitude, longitude) format and continued to work, but all new searches were saved as (longitude, latitude) and were busted.

How did we miss this in our testing? I will be the first to admit that we don’t do as much testing at Redfin as we could or should. However, we did have about 40 unit tests and 70 integration tests, and they definitely included scenarios that checked listings that are in/out of the search area. The trouble was that our test cases were too “symmetrical”. This is best explained by an illustration:

Symmetric and Asymmetric Test Cases

Well, when we discovered this problem, we spent the first few minutes being horribly embarrassed for letting it slip through. Then we buckled down, found a solution, and released it later that day. We also added a bunch more test cases to catch the problem in the future. And we sent an email to all the affected users with our apologies for the missed emails. At this point we believe listing alerts work again.

So, what are the lessons?

Strive for consistency in your data, even if it’s more painful and/or takes more time.

Avoid dual-DB situations. Both MySQL and Postgres support the SQL standard, but there are enough differences between them to make life difficult, and this lat/long issue is just one example. Now, we chose to take a toe-dip, and if we had to do it over again I think I would have made the same decision. In other words, do as I say, not as I do. :)

Pick asymmetric test cases. Discussed above.

Add constraints. Our database didn’t complain when the coordinate (47.5, -122), which is in Seattle, was rewritten as (-122, 47). Take a moment to think about that. A longitude of 47 is probably somewhere around India. A latitude of -122 doesn’t exist: it’s beyond the south pole! Why didn’t the database complain? Because as far as the database is concerned, this is just a point, and it’s a perfectly valid point. If you can afford the performance hit, it helps to add constraints to tell the database: in *my* application, a Y value of -122 is no good.

Don’t repeat yourself. The search area for a saved search was actually stored in two formats for two purposes. The first purpose was for evaluating the saved searches for the purpose of sending emails, as discussed above. The second purpose for interactive use – where you can simply pull up your saved search on the website and click “run search”. In the second case, we chose not to use the bounding box because your view may be a little different each time you pull it up, depending on the size of your screen or whether the window is maximized. For interactive use, it’s better to use the center point of your search and a zoom level – this combination can adjust more gracefully to different situations. The problem in our case was that we separately stored the latitude and longitude for interactive use. Had we simply derived them from the polygon’s center, we would have caught this error earlier.

Image credits: http://www.slashfilm.com/2007/02/20/movies-that-should-have-won-an-academy-award-but-didnt/


October 22, 2007

Do Crunches Really Work?

Last week, Microsoft released a beta version of a tool called MapCruncher. MapCruncher allows you to take a rendered image (GIF, PNG, JPG) and create a set of map tile layers that can be then drawn over a Virtual Earth (VE) map with the correct projection.

BART Overlay Test ScreenshotSince Redfin’s map is based on VE’s platform, I spent a couple hours using MapCruncher on a test project to see what we could potentially do with it. My project was to take an image of the BART routes in the Bay Area and place it on the map.

It worked out pretty well. A screenshot of the resulting mashup map that was created, or you can just check out the resulting mashup yourself.

What did I learn about MapCruncher?

  1. It’s really easy to use. It provides a side-by-side view of your desired overlay image and a live VE map view. You simply line up landmarks in the cross hairs of each view and mark it as the same location.
  2. It’s pretty quick. I had generated the first version in over an hour and it worked.
  3. Your overlay images need to be very basic and have no text. The original route map image I tried had all the location names and some other features drawn on it. When it was re-projected, the text was completely warped out of shape and unreadable. I used a image editing tool to remove all the text, background colors, etc. - everything except the routes themselves. That simplified route map image considerably cleaned up the resulting mashup.
  4. More landmarks will definitely improve your results. This may seem obvious, but MapCruncher says it only needs around 10 waypoints to render the mashup. I set 10 along the west edge of the map and two on the right. It did great alignment on that left set of points, but the projection was way out of whack along the right in Oakland and Pleasanton. I ended up setting 21 points with an even mix across the different quadrants of the map.
  5. Zoom levels will make you want two source images. I set the render to only go to zoom level 11 first, but didn’t like how soon the route map disappeared when I zoomed in lower. So I cranked it up to render all the way down to 13. Since it’s all doing it from one image, the routes were drawn gigantically at lower zoom levels. I’m guessing you probably need to have different source images for different sets of zoom levels to account for size differences.

Here’s the components of my project if you want to try it out yourself:

MapCruncher seems like its great for either a simple image overlay or a quick and dirty project. If all we had was a rendered image of some information, I could defintely see it as a way to get that data integrated. Honestly, I can’t now imagine trying to generate a set of projected tile layers of it without MapCruncher.

You’ll probably have to spend a fair amount of time getting the overlay images just right so that they work at all zoom levels. You’ll probably want to play with transparency as well to make sure it doesn’t completely obscure the map. At some point, I’ll have to try rendering the image from raw GIS data in a tool like ESRI to see how that improves the results.

You’re probably wondering when this will show up on Redfin’s map. I’m not sure it will. We already display all the BART stops on the map thanks to VE. Adding the route overlays actually adds a lot of clutter to the map without providing a lot more information.

We could certainly be wrong on that though. Drop us a comment if you believe this is very valuable info to add to the Bay Area map or if you have suggestions on what other data overlays we should add to help in your house search.


October 9, 2007

AJAX Performance Part 1: I Love Your Website, but it’s so Slow!

One of the main goals of our latest release was to improve the overall performance of the user interface, so we’re starting an intermittent series today on the dev blog talking about what we learned along the way.

Firstly, we had to recognize (channeling Steve Souders of YSlow and High Performance Web Sites fame) that the largest part of our performance problems were on the client-side, and the problems were most severe in IE6.

We set about trying to optimize client performance independent of network latencies and server query times. For the most part, this meant reducing the time the browser spent running our JavaScript. One reason this is particularly important is that the browser does not do anything else while JavaScript is running; the UI is completely locked up. No events, no back button, no browser menus.

If you do a lot of heavy processing on the client side in JavaScript, this can be a real problem since it causes visible delays and makes a web application generally clunky and unresponsive. Accepting that there were times when our website was sluggish, we set out to improve the overall performance of our user interface. More about how we figured out where to start after the jump. Read the rest of this entry »