Mr. McCat's
Mysterious Coprolite Emporium |
||||||||||
Home | Blog | About |
2025-01-07
Wanna hear a funny thing about Excel? Maybe two funny things, can't sleep anyway, so here goes.
There is such a thing in most spreadsheet software, Excel included, as Pivot Tables. It is a curious tool — it allows one to compute various aggregates from the source data and lay it out in a nice report with hierarchies along the rows and columns, with corresponding aggregate numbers at the intersections. All in all, a poor man's OLAP cube.
It was introduced around thirty years ago, and it was all beautiful, but being able to source data only from within the sheet or other Excel files lying right next to it was annoying at best. Everyone quickly started moaning how tired they were of copy-pasting their precious shit goodies off of other sources. As a result, Excel started growing data ingest warts all over — worksheet file connections, CSV connections, DBMS connections, XML import — all of them done via different underlying mechanisms and in different corners of the already sprawling cyclopean UI that is Excel.
That was decidedly better than copy-pasting things, but something was still amiss. Mainly because if you had to merge and preprocess data from several sources, you'd still have to copy-paste things onto a spreadsheet and rawdog it by hand.
For example, you might have some data in an XML file, some DBMS dump, a handy lookup dictionary in a CSV, and maybe go fetch some JSON data over there from your buddy's server, and you want to abuse inner joins to make it into a single view and latch some calculated columns on top, just to make the matters worse. Without automation, that's an ardorous task at best.
Worse yet, you might just need to update that data. Or worse yet, you might need to update it every other day. Or five times a day. All in all, without some automation, you're well and truly fucked.
That kind of nightmare is a first-class citizen in a corporate wage slave's life, so naturally corpies were all more than willing to shell some buck the MS's way if they made this less of a muddy disaster. And they did deliver! The thing's called Power Query, and it's an Excel that handles the ETL (Extract-Transform-Load) cycle using various sources, loading stuff either onto a nicely formatted table in the worksheet, or straight up feeding it into a Pivot Table.
Now, since we're all about formats and the cataloguing of mortal sins of their authors here, the question we're after is not really of how it's used, or the history of it, but rather how it is stored in your average .xlsx
file.
Today's office file formats (OOXML, ODF) are usually zip archives with a specific structure inside. There are many parts describing the actual contents, and some parts describing which part contains what, and most of it is in XML.
It's all relatively consistent and sometimes even neat, so, as common sense would rightly suggest, it would be prudent to store such a feature's configuration as XML files as well — why fuck around?
Alas, the twillight genius does not work that way, he must has his own unique, special, and mysterious way.
And here starts the voyage to the center of the Earth, one of a caliber that would tear the Dante's sorry arse into something loosely resembling a fractal manifold.
That's your mortal world.
Poke at Explorer, open Excel, spreadsheet, spread shit, pivot what, move fields around, enjoy the show. Not particularly interesting.
Here we have basements, rats, bums squatting, worms squirming, feces hurtling down the drain.
We have a document. The one we've just been poking at on the Ground Floor. Far as we know, it's a zip file, with some actual content and some special files telling us which files contain what, mostly in XML. A'ight. Not surprising yet.
Here we have karst seas, Dagonian cultists, and lots of mold.
In our document, there is a description of our Pivot Table in the form of an XML document: which fields it has, which aggregates to calculate, that kind of junk. Still not surprising. Let's look at what is specified as its data source. Normally, it'd be a cell range on some sheet, but in our case, since we've used Power Query, it's some kind of “connection”.
Bedrock. Dark, stuffy, reeks ever so slightly of sulfur.
Connections. They are defined in yet another XML file inside our document, one we have to check yet. In there, we see connections being defined. Maybe all our multi-source tower of Babel is described here? Ha-ha. No. For our connection in question there is only a cryptic heathen spell that briefly mentions “OleDb”, and a relationship attribute pointing to yet another XML file inside our document. Intriguing. Let's follow.
All around, imps crawl the ceiling, in kettles filled with half vitriol and half acid there are sinners, wailing like banshees, only out of their sorry arses since that is the only part that sticks out of the surface.
So, we've found that (yet another) XML file. It only has one and a half nested elements. On the inner one, there are no attributes, no other tags nested, only a mile-long base64 string. Manic laughter booms eerily from the neighboring caverns.
No sinners, no imps, no features of note here, only the Satan's schlong curling around the planet's core 666 hundred times without ever crossing itself — an NC-17 rated exercise in applied topology.
Schlongs aside, do you remember we've found a base64 string? That's right, it encodes a binary structure with several fields. It has some flags and a few integers in Little Endian, which is not at all surprising at this point — it's a binary structure after all. However, among the innocous fields there is one that contains... a blob.
Abyss. Perpetual fall into the black maw with no sign of it ever having an end.
That blob? It's another package. As in, a ZIP file, that has some parts describing content, and some service parts describing what's where. Almost there.
Chaos. Non-existence. There is nothing.
That is, nothing except that in one of the files inside that ZIP file there is a text file that describes taking those multiple sources we mentioned at the start and contorting them into one disgusting chthonic abomination of a table. There it is, our golden McGuffin, with a brownish tinge to it.
Catharsis, dissolution.
For that description our dear colleagues in our decidedly mentally ill industry have created... a separate language. With its own unique grammar and execution model and all the other shit.
Why though? Yes, to load a CSV into a spreadsheet.