It is a very interesting data set and it is a great thing Microsoft has issued it. The (temporary) limitations on Texas and California are mentioned in the GeoJSON / JSON topic and in the Example: Import GeoJSON / JSON File example topic. That latter topic also has illustrations of some of the anomalies. Hugh has also noted with an illustration a class of many anomalies, buildings hidden under trees.
There is a lot to think about in connection with this Microsoft gift to the community. From a "big picture" perspective it shows the current limits on the state of the art in terms of automatic vectorization, given significant effort by a knowledgeable and reasonably well-funded team to use very large resources, a huge library of images and very large AI neural networks. Those limits point to directions where effort might be invested in the future.
From Microsoft's description of their work it is clear this was not an "unlimited funding" effort as sometimes mounted by Google. They did not attempt to leverage all possible inputs, such as LiDAR, street views and multispectral satellite photography, as controls on their results. Microsoft's effort had limits, which is good, because those make the process and results more relevant to what might be done by ordinary mortals within some reasonable time span.
It is an open question how useful the result is. The main problem is that it is huge data which is seeded with very many mistakes, including plenty of phantom buildings indicated in open fields (as seen in the example), numerous missed buildings plainly visible among other buildings and routine errors in the footprints constructed. One could say "it is a step forward compared to other such automated vectorization of satellite photos," which is probably true, but does not really change the conclusion, "sure, but the errors are why we use manually vectorized footprints in our jurisdiction," or "that's why we now focus on LiDAR...". It can be way more costly to find and fix errors seeded throughout than it is to simply digitize manually a particular area of interest in an assembly-line fashion.
That's OK. You could use this data to find areas where the buildings are likely to be, for such a process, for statistical purposes, as a check on LiDAR and so on.
It's also something to think about that Microsoft chose to publish the data as JSON. It seems likely to be a political decision given the spectacularly inefficient nature of JSON for gigabyte-scale data. The primary virtue of JSON is that it is human-readable, a value that is useful only for relatively small text. It's crazy to publish two gigabytes of text in human-readable form instead of in a fast binary format.
To get an idea of what 2 GB of human-readable text means, you can do a quick and dirty estimate: Internet tells us the average "page" of single-spaced text contains 3000 characters. 2 gigabytes of text therefore requires 666,666 pages, which at 11 inches per page (splitting the difference between A4 and "Letter" size) ends up being a set of pages that placed end-to-end would be 115 miles or 185 kilometers long.
The current limitation of 2 GB on JSON imports is related to that spectacular inefficiency of JSON for bigger data and to the belief that efficient efforts do not use human readable formats for data that requires 185 kilometers of single-spaced text. JSON is a text format and is basically treated as a text format by Manifold, the sort of thing that is used to save queries, programming text, and commentary. 2 GB seemed way more size than is necessary for the length of a query - who writes queries that are 185 kilometers long as single-spaced text? - or a program or as comments. So, Microsoft's puzzling choice of JSON for gigabyte scale data comes as a surprise.
Expanding the maximum size of text items beyond 2 GB to handle Microsoft's choice is a straightforward task. One wishes they had used GPKG or something else. :-)