Revisiting A Classic

I finished another programming project and I think it’s my strongest yet, thanks to me finally getting serious about testing! I called my app HuffmanRevisited because it implements the classic Huffman coding algorithm from 1951, and also because I had previously tried to program this algorithm a few months ago.

This time round, I coded it in Java, not JavaScript. It probably required about a solid week of work. And unlike my earlier attempt, I made a finished app. It can:

  • load a text file, encode (compress) it and save the encoded version, or else
  • load a previously compressed file, decode it and save the original content back as a text file

(You can skim my code here if you like.)

A few aspects of my approach helped make coding this an enjoyable task: I tried to specify it rigorously, I wrote a lot of tests, and I tackled a clear, well-known problem.

Before Christmas, I read a book – well, no, I skimmed a book after reading the prologue (which took multiple attempts) – called How To Design Programs. It’s an MIT textbook, you can read it here. I recommend it.

My paraphrase of the book’s prologue is: “precisely specifying what a program does is the hardest part of making it.”

Of course, I had encountered this sentiment in my Object Oriented Software Engineering module in the National College of Ireland. But the MIT textbook avoids the familiar paraphernelia of design patterns, verbose conventions and profuse diagrams. Instead, the book’s prologue challenges you to specify a problem by answering two questions: how is the data represented (in terms of data types that the machine can understand, and at all steps of the program)? and what are the functions (in terms of operations on those types)?

I mean, when I write it out, it’s dull and obvious. Yet, time and again I’ve found myself skimping on this specification step. Because, yes, it is the hardest part and one’s brain tries to escape it.

Even after my epiphany from reading the MIT book, I still evaded it. I specified most but not all of my Huffman encoding app before I started, thinking that the remainder was “simple”. But the simple part is never simple and if you haven’t thought it through, the solution will be incoherent even if it seems to work.

I failed to specify all of HuffmanRevisited, but at least I knew that this failure was to blame when I witnessed complexity mushrooming in front of my eyes as I tried to solve new, small problems that kept cropping up.

BTW, I’ll mention a couple of those little problems to see if you spot a pattern that kind of pleased me:

  • accessing individual bits from an array of bytes
  • packing an unpredictable amount of bits into an array of bytes
  • turning Java’s int and char types into a byte representation (not just casting which truncates them)
  • saving a compact representation of a binary tree containing characters in its leaf nodes (the ‘Huffman tree’)

Yeah… the pattern I spotted is that I’m doing low-level stuff and pushing somewhat against the limitations of Java. This is nice because I’ve been looking for an excuse to study a more low-level language!

The other thing that went really well with this project was testing. I’d written tests in JUnit before, but to be honest I was doing it to fulfil obligations in school assignments. Just like with specifying rigorously, I knew that tests are a great idea but was lazy about writing them.

I totally changed my tune once I had the framework up and running. (I used JUnit 5, Maven and NetBeans 11, and I mention this combination because I had no joy with JUnit 4 or with Ant.) I realised I’ve always done a lot of testing, but amateurishly: printing variables to the console all the time. That works okay… until your program starts to have a decent amount of methods on the call stack (i.e., functions that call functions that call functions that call functions…) and you spend your time trying to remember in which method did your gnomic text originate. Plus, all those print statements mess up your code.

Not to sound too much like a new convert, but using a test framework was a delight after all that. It’s much more organised. (And with my let’s say “wide-ranging” mind, I need all the organisation I can get!) It’s just like what I was doing before, except you only see debug text if there’s a problem, you get to choose any range of values you like to test, you can test as small or as large a section of the program as you like (as long as you’ve made the program itself reasonably articulated), and all of this business lives in separate files. Oh, and you get a cheerful screen of green when you pass your tests!

It’s enough to warm anyone’s heart

Okay, so specification and testing are non-negotiable aspects of real-world software development, but the last aspect I want to discuss can be more of a luxury: a clearly defined problem.

Until I get a start in a programming job, I can’t be sure, but my impression is that even communicating what a problem is, never mind completing a rigorous specification, can be hard in a typical business context.

However, I did this project for self-study so I got to choose exactly what to work on.

(I was helped in this by a comp sci book called The Turing Omnibus that a mentor very kindly lent me! It has a chapter on Huffman coding. The hook of the book, I would say, is that it conversationally introduces topics but doesn’t take you through every nuance. For example, unlike the Wikipedia article on Huffman coding, it mentions neither the need to pad bytes with zeros, nor any scheme for storing a b-tree.)

I was so glad I chose such a an old chestnut of an algorithm to implement! When I was refactoring my way out of that mushrooming complexity I mentioned earlier, the clarity of my app’s intention was a godsend.

Even better was the lack of edge cases. I could be certain my program had worked when it took a text file, compressed it into a smaller file, and then decompressed that encoded version into the exact same data I started with!

That’s intrinsically neater than some other areas I’ve boldly attempted, for example digital audio or vector graphics, where you need good control of sampling and rounding.

When I do go back to such complex topics, I’ll have a crucial bit of extra experience with the exact weapons needed to contain the ambiguity. Testing and full specification.

So, I’ll sign off there. My app could easily absorb some more effort. The next thing to work on would be efficiency. Do some profiling, and just comb through it for wastages. I can also think of some cool ways to present it, but no point hyping work I may not get around to.

Anyway, I’m pleased with it already. Jaysusin’ thing works, and the code structure is fairly sensible.

Thanks for reading. Take care in these hard times!

The header image “Massachusetts Institute of Technology Parking Lot, Memorial Drive, Viewed from Graduate Housing” by MIT-Libraries is licensed under CC BY-NC 2.0. I chose it because David A. Huffman might have seen such a view as he wrote the term paper in which he invented his compression technique.

Why Is Python Nice For Learners?

I’ve come back to Python a few times in the past year. Before September 2019 I’d never touched it (or any modern programming language, really). Right now I’m using it to build a 3D game level generator, and very much enjoying myself!

I feel I’m close enough to the start of the Python learning curve to give a personal perspective on why this language suits dabblers and students.

This is not a prospectus of Python’s features or design decisions – I don’t know enough to talk about those. Just a happy acknowledgement of what’s been making my life easier in the past couple of days.

I’ll compare it to Java, the language I know best, which is also probably the closest to a standard for Year 1 of comp sci courses.

  1. The look
    Instead of the curly brackets of C, C++, Java and other much-used languages, Python uses the amount of tabs at the start of a line to determine what “block” that line belongs to (and also the colon at the end of any line that declares a new block). There’s something reassuringly basic and solid about that, to a newbie. You don’t need to learn the little skill of counting brackets (nor do you need to use semicolons) while discounting whitespace. Instead, how it looks in your editor tells you what’s going on directly.
  2. Simple string manipulation
    Python’s elementary text manipulations are simpler than Java’s. When you’re starting out, I think that helps keep your mind on the actual task. For example, getting input from the console in Java generally requires creating a “Scanner” – a powerful bit of software for carving up and selectively serving text from an incoming stream. Those capabilities are irrelevant for basic text input. So students end up depending on something whose quirks and abilities they don’t understand. Including some unintuitive behaviour that I’ve seen even experts admit is confounding.
    Also on this point, Python requires you to convert all number types to strings explicitly, when printing them. Which can be annoying. However, this arguably makes the underlying reality clearer than performing the same conversions invisibly as Java does.
  3. It doesn’t force Object Oriented Programming
    We’re getting close to the topic of many boring flame wars on which paradigm is better, but let’s keep this to an inoffensive point: Python affords procedural, functional or OOP styles of programming. So if you have experience of either, you can start from there. And if you don’t, you’ll be saved the laborious and ritualistic aspects of standard Java style: private member variables, getters and setters, etc. – until you’re ready.
  4. Those data structures
    I first read about Python in Jay Barnson‘s delightful old piece, which I have returned to many times since, “How To Build A Game In A Week From Scratch With No Budget”. In it, he eulogises its list and dictionary data structures. And he’s damn right, they’re great. Instead of wrapping data in near-pointless objects – a practice so widespread in the Java world it has its own name, “anemic objects” – you’ll find yourself creating, stacking and passing around these built-in structures, as I’ve done pretty much throughout my level generator app.
  5. It’s mellowed by age
    This one cuts both ways. You can find articles on Medium arguing that Python is past its peak, and getting less fashionable or even relevant by the year. Yet, the advantages include having plenty of 3rd-party libraries (though TBH I haven’t dipped into these yet) and also a smoothing away of rough edges. For example, Python 3.x (available since 2008) has what’s called “new-style classes” that allow for conveniences such as property annotations, getting info on a class or method at runtime… I actually don’t understand that stuff yet, but the point is, whereas a few versions ago I would’ve had to use a special syntax to get “new-style classes”, now that happens automatically. One less thing to think about.
  6. You don’t have to split files
    Another really small thing that nonetheless will be felt by newbies: Python scripts can have multiple classes, or none, in one file (called a “module”). Whereas even a minor task in Java might push you to make a few classes, and so have to save a few different files.

The end result of all these things together is: even if you’re not very good at it yet, solving problems in Python feels fast. I don’t myself have the depth of understanding to explain this one. But ever since I first tried a maths problem or scripting text templates with it, I was pleased by that feeling of getting things done.

BTW, I wouldn’t call Python a “beginner language” in the same sense I applied to Processing a few posts back. Python doesn’t have every interaction carefully shepherded so as to hide complexity. Nor does it provide a simplifying framework for graphics or what-not. It’s a full-featured language (although with a favoured domain of data analytics, science, scripting, and stuff like that, for sure).

Last thing. For some reason, I pronounce Python as PIE-THON, rather than PIE-thn. Does anyone else out there do this? Let me know I’m not alone.