The R programming language

The R programming language has become one of the standard tools for statistical data analysis and visualization, and is widely used by Google and many others. The language includes extensive support for working with vectors of integers, numerics (doubles), and many other types, but has lacked support for 64-bit integers. Romain Francois has recently uploaded the int64 package to CRAN as well as updated versions of the Rcpp and RProtobuf packages to make use of this package. Inside Google, this is important when interacting with other engineering systems such as Dremel and Protocol Buffers, where our engineers and quantitative analysts often need to read in 64-bit quantities from a datastore and perform statistical analysis inside R.

Romain has taken the approach of storing int64 vectors as S4 objects with a pair of R’s default 32-bit integers to store the high and low-order bits. Almost all of the standard arithmetic operations built into the R language have been extended to work with this new class. The design is such that the necessary bit-artihmetic is done behind the scenes in high-performance C++ code, but the higher-level R functions work transparently. This means, for example, that you can:

• Perform arithmetic operations between 64-bit operands or between int64 objects and integer or numeric types in R.
• Read and write CSV files including 64-bit values by specifying int64 as a colClasses argument to read.csv and write.csv (with int64 version 1.1).
• Load and save 64-bit types with the built-in serialization methods of R.
• Compute summary statistics of int64 vectors, such as max, min, range, sum, and the other standard R functions in the Summary Group Generic

For even higher levels of precision, there is also the venerable and powerful GNU Multiple Precision Arithmetic Library and the R GMP package on CRAN, although Romain’s new int64 package is a better fit for the 64-bit case.

We’ve had to work around the lack of 64-bit integers in R for several years at Google. And after several discussions with Romain, we were very happy to be able to fund his development of this package to solve the problem not just for us, but for the broader open-source community as well.

Google’s Cloud technology to Google Earth & Maps Enterprise

Our vision for Google Earth and Google Maps has always been to create a digital mirror of the world where any user, anywhere, and from any device or platform can access current, authoritative, accurate, and rich information about the world around them. In order to provide fast and timely maps to our users, we’ve developed powerful geo infrastructure that lets us process and serve petabytes of imagery and basemap data to hundreds of millions of users.

We frequently hear requests from governments and businesses – some of whom use our existing Enterprise Earth & Maps products today – that they would like to have greater access to some of the infrastructure we’ve built in order to more easily store their geospatial data in the cloud and more easily build and publish maps for their users.

Today we announced Google Earth Builder, which continues the spirit of providing more access to Google’s core infrastructure, such as Google App Engine and Google Exacycle.

Google Earth Builder is an Enterprise mapping platform powered by Google’s cloud technology. We’ve built Google Earth Builder with the idea that any organization with their own mapping data – be it terabytes of imagery or just a few basemap layers – should be able to upload and manage that data in the cloud. They can use Google’s scalable infrastructure to process and securely serve it through familiar Google Earth and Maps interfaces to their users.

Our goal for Google Earth Builder is to enable Enterprises that work with geospatial data and create online maps to be able perform these tasks in the cloud. Over time we anticipate providing access to more and more of our geo infrastructure through Google Earth Builder, so businesses have more options for how to process, publish and analyze their geospatial data. We’re excited to launch Google Earth Builder in Q3, and in the meantime if you are interested in learning more then please get in touch.

Announcing Google Refine 2.0, a power tool for data wranglers

Our acquisition of Metaweb back in July also brought along Freebase Gridworks, an open source software project for cleaning and enhancing entire data sets. Today we’re announcing that the project has been renamed to Google Refine and version 2.0 is now available.

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch the following screencasts: (7 min) (9 min) (6 min)

The project is open source and its code and downloads are available here. Changes from version 1.1 to 2.0 are listed here.

By David Huynh, Google Search Infrastructure