Michael Smedberg




Principal Engineer at Redfin

Recent posts

March 4, 2008

How to search Redfin directly from IE and Firefox

I recently heard that some loyal Redfin customers were using Google to do address searches. That’s a shame, since Redfin does a pretty decent job of searching for addresses, MLS IDs, cities, etc. I wanted to see if I could help those power users get to Redfin search results with fewer clicks.

Naturally, I wanted to see if I could make a Redfin search available. It was surprisingly easy- I didn’t have to change any Redfin code. I just wrote an XML file, hosted it on a Web server, and put a link into an HTML page (this page, in fact!)

If you’re using Firefox or Internet Explorer 7, you should be able to enable Redfin search by choosing the relevant option in the search dropdown while viewing this page. Here’s a screenshot of Firefox:

add redfin search to firefox

Here’s how it looks in IE7:

add redfin search to ie7

Once you’ve added Redfin, you can do address searches on Redfin using the search control:
search with redfin

Another way to do searches in Redfin is to use a keyword bookmark. In Firefox, make a new bookmark, and edit it to look like this:

redfin keyword bookmark

The important parts are the location and the keyword. The location got cut off in my screenshot, but it should be http://www.redfin.com/stingray/do/listings-search#search_location=%s

(the only tricky part is the “%s” part, which will get replaced with whatever you search for.)

To use it, you type the keyword and the search terms into your location bar, like this:

search for headquarters

or this (for an MLS):

search for mls


November 5, 2007

Elephant versus Dolphin: Which is Faster? Which is Smarter?

Elephant vs.MySQL

Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.

GIS Indexes

Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.

How We Did It In MySQL

In MySQL, the queries might look something like:

SELECT
    *
FROM
    listings
WHERE
    price <= 400000 AND
    num_bedrooms >= 2 AND
    num_bathrooms >= 1.5 AND
    type = 'condo' AND
    MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col)
LIMIT
    101

where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:

  • When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
  • When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
  • When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
  • When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it’s found 101, it can give up)

However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:

  • The table is large (tens of millions of rows)
  • The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
  • MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint

This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She’s looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she’s searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for “fixer” properties that have a low asking price and large living area. Again, this results in a query that’s tightly constrained by some criteria, but relatively loosely constrained by geography.

Postgres and PostGIS

Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.

Other Considerations

MyISAM and Data Corruption

In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.

Replication

We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.

Advanced Features

Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.

Conclusion

Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.
elephant_lift


August 31, 2007

Getting the Time Zone from a Web Browser

Writing rich date/time features in a web app can be a pain. Apps (such as schedulers) that do math on times (e.g. ordering times) should pay attention to time zones for those times, but it’s difficult to know which time zone should be used to display the times to the user. Asking the user to explicitly choose a time zone is natural and often necessary, but a long list of time zone choices can be intimidating to the user. I’ll discuss one method you can use to detect the probable time zone of a browser. It’s not perfect, but it offers a good default (and it’s easy to code.)

Time zones

For apps that don’t have rich date/time functionality, times can be represented as simple numbers or strings. For example, if I wanted to meet you in San Francisco to go to the late showing of The Bourne Ultimatum at the Metreon theater, it’s probaby fine to say “let’s meet at the theater at 11:15 PM on Friday; the show starts at 11:30 PM.” Since we’re in the same time zone as each other, and as the theater, we don’t really care about time zones. An application that’s facilitating this interaction could store and display the times “11:15 PM” and “11:30 PM” without regard to time zones.

Metreon

If we want to do date/time processing on the back end (e.g. ordering events in time), it’s more general to store absolute times in the database. For example, instead of storing something like “hour => 11, minute => 15″, which might mean different absolute times to different users, it is convenient to store a canonical time such as the number of seconds since January 1, 1970 in GMT. That way, we can compare all of the times in our database without having to worry about the time zone for each.

GMT

If we store “absolute” times in the database, then displaying times to users becomes a localization issue. It’s pretty similar to localizing web site content into the language of the user. You start with a notion of what you want to present to the user, then you identify how the user wants to receive it (i.e. their language/locale), and then you localize the content at runtime (i.e. you show them the content in their language.)

As with language, you need to know about the user. The HTTP protocol specifies that the “Accept-Language” header can be used by servers to find out which language(s) the user prefers. The “Accept-Language” header is nice because it lets websites show content in the “correct” language without having to explicitly ask the user. A user who only speaks French doesn’t have to puzzle through an English language page that says “click here for French” somewhere in a footer- they just see the content in French. Even better, it’s one less setting that the server has to manage, and that the user has to set and keep up-to-date.

Unfortunately, there is no corresponding “Accept-Timezone” header- the HTTP standard does not contain any facility to allow the browser to automatically tell the server what time zone the user cares about.

There are two standard ways for developers to deal with this.

First, they can ignore it. For many apps, this is a decent approach- just store “11:30 PM”, and don’t worry about the time zone. As long as all the users who care about that time know what time zone it’s in, then the app doesn’t have to keep track of it.

Second, they can ask the user to make an explicit choice. For example, when setting up Google Calendar, you are asked to choose a time zone. That’s fine for the developer, but finding the “right” time zone in a long list can be a pain for the user.

I wanted to let users choose a time zone on my site, but I also wanted to have an intelligent default- for most users, they shouldn’t have to take any action; the choice I make for them should be correct.

This calls for Javascript on the client. I wanted to write some Javascript that would choose the right option in a time zone dropdown.

This is slightly harder than it seems because Javascript ALSO does not contain a way to get the time zone of the user. Javascript DOES, however, provide a way to get the offset from GMT for any particular time. A time zone can be thought of as a rule that says what the GMT offset is for different times. We can therefore do a reverse mapping- if we know the GMT offset for a few times, we can figure out the time zone for the user. Time zones can be quite complicated (some include Daylight Savings Time, some start or end DST on different dates than others, sometimes the DST offset isn’t a full hour, etc. There are even time zones that are identical for all FUTURE times, but had differences in the PAST.)

In theory, we could deal with all of these cases by doing many probes- we could check the GMT offset for many times, and get an exact time zone match. In practice, this really isn’t necessary- most users are in the more populous time zones, and the cost of failure (defaulting to a time zone that’s similar but not quite right) is not terribly high. Instead, we can probe two times (one in the summer and one in the winter) to find out the normal GMT offset, whether the time zone has Daylight Savings Time, and the DST offset.

In terms of implementation, I wanted to basically make a list of recognized offsets. That is, a list that says “if the summer offset is -7 hours, and the winter offset is -8 hours, then the time zone is probably US/Pacific.”

I like hacking in Ruby, so I grabbed the TZInfo Ruby library, and wrote some code to run through the known time zones, figuring out the winter and summer offsets for each. After grouping by offsets, I had to choose a winner in the case of duplicates. When multiple time zones had the same summer and winter offsets, I searched for each of them on Google. I figured that the time zone with the most hits was probably the most popular one, so I chose that one. Here’s the Javascript code that I came up with:

function getTimezoneName() {
	tmSummer = new Date(Date.UTC(2005, 6, 30, 0, 0, 0, 0));
	so = -1 * tmSummer.getTimezoneOffset();
	tmWinter = new Date(Date.UTC(2005, 12, 30, 0, 0, 0, 0));
	wo = -1 * tmWinter.getTimezoneOffset();

	if (-660 == so && -660 == wo) return 'Pacific/Midway';
	if (-600 == so && -600 == wo) return 'Pacific/Tahiti';
	if (-570 == so && -570 == wo) return 'Pacific/Marquesas';
	if (-540 == so && -600 == wo) return 'America/Adak';
	if (-540 == so && -540 == wo) return 'Pacific/Gambier';
	if (-480 == so && -540 == wo) return 'US/Alaska';
	if (-480 == so && -480 == wo) return 'Pacific/Pitcairn';
	if (-420 == so && -480 == wo) return 'US/Pacific';
	if (-420 == so && -420 == wo) return 'US/Arizona';
	if (-360 == so && -420 == wo) return 'US/Mountain';
	if (-360 == so && -360 == wo) return 'America/Guatemala';
	if (-360 == so && -300 == wo) return 'Pacific/Easter';
	if (-300 == so && -360 == wo) return 'US/Central';
	if (-300 == so && -300 == wo) return 'America/Bogota';
	if (-240 == so && -300 == wo) return 'US/Eastern';
	if (-240 == so && -240 == wo) return 'America/Caracas';
	if (-240 == so && -180 == wo) return 'America/Santiago';
	if (-180 == so && -240 == wo) return 'Canada/Atlantic';
	if (-180 == so && -180 == wo) return 'America/Montevideo';
	if (-180 == so && -120 == wo) return 'America/Sao_Paulo';
	if (-150 == so && -210 == wo) return 'America/St_Johns';
	if (-120 == so && -180 == wo) return 'America/Godthab';
	if (-120 == so && -120 == wo) return 'America/Noronha';
	if (-60 == so && -60 == wo) return 'Atlantic/Cape_Verde';
	if (0 == so && -60 == wo) return 'Atlantic/Azores';
	if (0 == so && 0 == wo) return 'Africa/Casablanca';
	if (60 == so && 0 == wo) return 'Europe/London';
	if (60 == so && 60 == wo) return 'Africa/Algiers';
	if (60 == so && 120 == wo) return 'Africa/Windhoek';
	if (120 == so && 60 == wo) return 'Europe/Amsterdam';
	if (120 == so && 120 == wo) return 'Africa/Harare';
	if (180 == so && 120 == wo) return 'Europe/Athens';
	if (180 == so && 180 == wo) return 'Africa/Nairobi';
	if (240 == so && 180 == wo) return 'Europe/Moscow';
	if (240 == so && 240 == wo) return 'Asia/Dubai';
	if (270 == so && 210 == wo) return 'Asia/Tehran';
	if (270 == so && 270 == wo) return 'Asia/Kabul';
	if (300 == so && 240 == wo) return 'Asia/Baku';
	if (300 == so && 300 == wo) return 'Asia/Karachi';
	if (330 == so && 330 == wo) return 'Asia/Calcutta';
	if (345 == so && 345 == wo) return 'Asia/Katmandu';
	if (360 == so && 300 == wo) return 'Asia/Yekaterinburg';
	if (360 == so && 360 == wo) return 'Asia/Colombo';
	if (390 == so && 390 == wo) return 'Asia/Rangoon';
	if (420 == so && 360 == wo) return 'Asia/Almaty';
	if (420 == so && 420 == wo) return 'Asia/Bangkok';
	if (480 == so && 420 == wo) return 'Asia/Krasnoyarsk';
	if (480 == so && 480 == wo) return 'Australia/Perth';
	if (540 == so && 480 == wo) return 'Asia/Irkutsk';
	if (540 == so && 540 == wo) return 'Asia/Tokyo';
	if (570 == so && 570 == wo) return 'Australia/Darwin';
	if (570 == so && 630 == wo) return 'Australia/Adelaide';
	if (600 == so && 540 == wo) return 'Asia/Yakutsk';
	if (600 == so && 600 == wo) return 'Australia/Brisbane';
	if (600 == so && 660 == wo) return 'Australia/Sydney';
	if (630 == so && 660 == wo) return 'Australia/Lord_Howe';
	if (660 == so && 600 == wo) return 'Asia/Vladivostok';
	if (660 == so && 660 == wo) return 'Pacific/Guadalcanal';
	if (690 == so && 690 == wo) return 'Pacific/Norfolk';
	if (720 == so && 660 == wo) return 'Asia/Magadan';
	if (720 == so && 720 == wo) return 'Pacific/Fiji';
	if (720 == so && 780 == wo) return 'Pacific/Auckland';
	if (765 == so && 825 == wo) return 'Pacific/Chatham';
	if (780 == so && 780 == wo) return 'Pacific/Enderbury'
	if (840 == so && 840 == wo) return 'Pacific/Kiritimati';
	return 'US/Pacific';
}