java, maven, and include files

Libraries, Interfaces, and Dependencies

The code base for my new job is rather larger than the code base at my last job, which means that I have to start thinking about the build process more than I’ve been in the habit of doing recently. I know how to set up a good build system for C-based languages (hint: gcc -MD is your friend), but I’ve never felt as comfortable with the options for Javaish environments, even for relatively straightforward projects that are just building a single program.

And we’re not building a single program: it’s a distributed system, with a collection of programs using some common libraries and some distinct libraries. Which is quite similar to the situation at Kealia, but the build system feels rather different. Part of that is related to the source code layout (we’re putting each module in its own git repository), but part of that is due to differences in the Java/Scala way of compilation and the C-language way of compilation.

At Kealia, our product code was spread across three directories: interface, which had one subdirectory for each library, containing the public header files for that library; lib, which had one subdirectory for each library, containing the .cpp files and private header files for that library; and prog, which had one subdirectory for each program, containing the code specific to that program that we didn’t feel like pulling out to a library. So the build process was in two phases: it built all the library code in parallel, and then all the program code in parallel. (If I’m remembering it correctly, actually, the build system was happy to start working on program code even if some of the libraries weren’t done building, but at any rate dependency chains at the library/program conceptual level were never more than two deep.)

In Javaish languages, though, that doesn’t work, for the simple reason that the language doesn’t have a notion of a header file: if library A depends on library B, then you have to build library B before library A. Java programmers like to talk about how, in C++, header files mean that you have to write your class interfaces twice; that’s a quite valid complaint, but it bites you in situations like this. Though the situation isn’t quite as simple as that description makes it sound: in a lot of situations, Java interfaces end up acting a lot like C++ header files, giving you some amount of separation of compilation while also requiring you to write some of your class interfaces twice. Certainly proper use of interfaces is important in reducing long dependency chains; I don’t think we’ll be able to get down to two levels of dependencies, but what is the shortest chain that we can reasonably get to?

One answer might be three: break each library up into an interface part and an implementation part, like the include and lib directories from my C++ example. Programs should only depend on libraries and interfaces, and library implementations should only depend on other libraries’ interfaces, never on their implementations. That sounds plausible to me, both in terms of being achievable and in terms of being a good idea, but I’m not sure that it’s enough to get us to three levels of dependencies. The problem is that one library’s interface can reasonably depend on another library’s interface: e.g. if you happen to be writing a video server, you might reasonably have some generic MPEG classes that are referred to in the interfaces of multiple libraries, which means that they themselves have to be in a separate library. So that’s four levels of dependency; I can even imagine that it would be hard to avoid introducing a fifth level, because you might have some fundamental data structures (e.g. some sort of widely used numeric abstraction or custom collection) that shows up in the interface of shared classes that are a bit more domain specific.

So: five levels of dependency. That doesn’t sound like a very good rallying cry, does it? Still, at least it’s a stake in the ground, and having those five layers be well defined (programs, library implementations, library interfaces, shared interfaces, and a handful of core interfaces) is something.

Is even that achievable, though? C++ header files define a few different sorts of concepts. There are abstract interfaces; these are a lot like Java interfaces. There are ADTs; these belong in my interface layer above, but contain actual implementation as well. And there are methods that you call without having an object instance at hand: static methods, but also constructors.

What do you do about this latter class of functions? In C-family languages, the linker happily resolves them for you; what’s the analogue for that in Java? I guess the answer there is: if you have a static method (including class constructors) associated to a class that’s not plausibly an ADT, then you have to make sure that there’s an extra layer of indirection involved to let you do the wiring at the program level instead of at the cross-library level. In other words, you have to get serious about dependency injection: library A shouldn’t depend on the implementation of library B, even to do the basic setup for that library. Instead, library B should abstract all those static methods out into an interface, library A should be written in terms of that interface, and every program that uses library A should pass an implementation of that interface to library A. (It can get that implementation from a static method in library B, but the point here is that it’s the program layer that’s grabbing the static method from library B, not library A’s implementation layer.)

At first blush, I think that should work. Actually, it even seems plausible to me that it would lead to better structured programs than you would get if you didn’t worry about rigorously adhering to such standards. Five layers of dependencies: it’s good for your software! (Not the best rallying cry I’ve ever heard…)

Maven

The other aspect of our setup at work that’s new to me is Maven. I’m really not sure what I think about Maven yet: I’m willing to believe it’s better than Ant, but I’m also willing to believe that that is not much of an accomplishment. I like its emphasis on convention, I like the way it’s good at pulling in known versions of external resources (your Scala compiler, your jUnit version, etc.). But I’m less sold on its insistence in separating every artifact into a separate project: what happened to Recursive Make Considered Harmful? And it boggles my mind that, in this day and age, a build system would be written that only supports parallel builds as an experimental feature in its as-yet-unreleased third major revision number bump! We’ve known for most of a decade now that many cores are the way forward, yet I’m using a build system that can’t even keep two cores on my machine busy? That is ridiculous.

Another part of the picture is Hudson: our build system at work is set up so that you’re encouraged to have Hudson build artifacts that you’re not actively touching. (And Hudson is capable of doing those builds in parallel. This perhaps makes it possible to live with the aforementioned Maven limitation, but in no sense excuses it: if I’m starting up a project, I absolutely should not be required to set up a continuous integration server just to get non-braindead compilation performance. I realize that the Java world is not known for its fondness for lightweight systems, but that is… Sigh. I will stop ranting now.) I haven’t yet figured out what the Maven way is here: is Maven perfectly happy for you to have all your build artifacts under your own control, is Maven actively hostile to you having all your build artifacts under your own control, or is Maven agnostic about that question?

This all also bears on the question of source code repositories that I mentioned above. At first, once I got over my surprise, I kind of liked the idea of having each module in its own git repository: let’s take the idea of separate modules seriously. But once we start breaking up one conceptual module into an interface part and a library part, I’m less happy about having them in separate repositories: you’ll want to change an interface and the code that implements that interface in lockstep, which means that changes to them should be part of a single commit. And I’m also not sold on this idea that we should have Hudson in charge of building all these different artifacts for us: maybe I’m conservative, but I still like the idea of checking out a known version of the entire project’s source code, doing a build of it, and getting a reliable, reproducible result.

Lots of interesting questions to think about, lots that I still have to learn. (And that’s without going into my discomfort about dependency info generation for Javaish languages!) Probably the best thing for me to do would be to start a project at home: I should come up with something to write (probably purely in Scala, I don’t think that mixing Java in will either make me happier or clarify the situation) that I can plausibly break out into multiple libraries, so I can really grapple with these dependency issues.

Post Revisions:

This post has not been revised since publication.

Published 3/28/2011 & Filed in Programming

5 Comments

Comments closed

Comment by Mark Gritter

I’m not a fan of either Ant nor Maven at this point, having had to recently interact with both.

The Apache libraries are a good example of interface/implementation split in Java. But there it often seems like the same class gets written 4 or 5 times, not just two.

3/28/2011 @ 11:03 pm
Comment by Damon Silver

The big pluses for me of Maven over Ant were the aforementioned ability to pull down external dependencies and the ability to peg this version of this build to a particular combination of dependencies, each with a predefined version attached to it. I also didn’t have to redefine these dependencies in project B if it depended on project A, which had already defined said dependencies.

Beyond that, I absolutely think you should be able to build via Maven (or Ant, or your pick your favorite build manager) from the command line and it have it work readily, reproducibly, and reliably. If this isn’t the case, something’s wrong with either your code tree (including too much cruft? time to clean up your pom.xml files and/or your repository), your build environment (not enough RAM), or the build process itself (performing unnecessary intermediate steps).

Hudson shouldn’t replace this; what it can do is take away a lot of the headaches that come from building the entire tree, especially in a large codebase, and doing other tasks like running the suite of unit tests against the entire code tree (you’ve got these, I sincerely hope), creating and archiving versioned artifacts, and deploying to particular locations when you’re ready to do so.

3/29/2011 @ 12:26 am
Comment by David Carlton

Thanks for the Apache pointer, I’ll give it a look.

And thanks for the Maven advice. I guess what I don’t understand about the Maven/Hudson combo is this: should Hudson be publishing versions of the artifacts it generates? If so, how do you control whether your local build gets the Hudson version of the artifacts or the local version? I can I imagine times when I’d want either of those outcomes, but if they’re both available, I want to reliably be able to choose between them.

I’ll have to look more closely at our build scripts, because we have scripts that claim to do both of those, I’m just not yet familiar with the details of either how they work or what exactly mvn install does. (In particular how Maven decides when to use an artifact you’ve installed versus an external artifact.) Also, for better or for worse our top level build is controlled via a shell script rather than via Maven; I don’t have a good feel for why we made that decision and what the pros and cons of it are.

3/29/2011 @ 2:51 am
Comment by Damon Silver

Hudson can publish versions of the artifacts it generates but it doesn’t have to; you can manage that entirely through Maven or some other process if you prefer.

To control what your local build gets, check out Maven’s concept of profiles, where you can define from the command line which preconfigured set of pom/environment variables to use during a Maven run.

‘mvn install’ is mainly used to add artifacts to the local repository. If, for example, you introduced a dependency on an artifact that’s not publicly available, you’d have to manually install it. If the artifact is publicly available but you want to install from a particular source, you can do that as well.

If you aren’t already using it, look into running Nexus from Sonatype (or similar repository managers), which does a couple of nice things. First, it serves as a repository proxy by storing copies of external dependencies so you don’t have to trawl the web for them every time you need them, significantly speeding up your builds under certain circumstances, particularly the first time (before you’ve downloaded the dependencies). Second, it can store versioned copies of locally built artifacts for general use in your build. E.g., suppose there’s some complicated process to build $foo. Everyone needs to be able to build $bar, which depends on $foo. Not everyone needs to or wants to build $foo. Maven checks Nexus (or Hudson, or a manually-defined location, etc., depends how you configure Maven) for the version of the $foo library it needs (previously uploaded by another developer), happily downloads it, and the developer in question proceeds to building $bar quickly without having to jump through a lot of hoops such as ‘mvn install’.

To break it down again:

* Maven: build manager that does the nitty-gritty operations of actually building your deployable artifacts
* Hudson: continuous integration server that can also manage builds (using Maven or Ant or make or shell scripts or …), store versioned artifacts, and deploy according to instructions
* Nexus: repository manager that serves as a proxy to the outside world and an easy place to store internal or external versioned artifacts for local builds

Finally, a couple of guesses as to why you’re using shell scripts at the top level:

1. Legacy code.
2. There’s something too complicated to bother to figure out how to do in Maven (e.g., sometimes within Maven you use a callout to Ant for certain tasks, but it’s not a requirement and there’s often a way to do it in Maven anyway).

3/30/2011 @ 12:58 am
Comment by David Carlton

Thanks for the explanation / suggestions, I really appreciate it!

3/30/2011 @ 10:00 am

malvasia bianca