02/14/13 by Fabian Lange
In this post I want to give you an insight into the architecture of CenterDevice, a document management and collaboration tool for the enterprise hosted in our own cloud datacenter in Germany. CenterDevice is a startup of codecentric, with a few codecentrics working full or part time on it.
Let me start with a screenshot from our AppDynamics monitoring.
The green boxes are all Java 7 based services monitored with AppDynamics. The number “4” within the blue circle indicates that at the moment there are currently 4 instances each running. This number can change when services are scaled up and down. We currently run a mix of KVM virtualized and non-virtualized CentOS instances on high end Dell Machines with a total of about 148 cores, 600GB memory and 150TB disc space.
The lines connecting the boxes indicate call directions, their average response times and how often they were made in that time interval. What is really great about AppDynamics is that the services, and how they talk to each other is automatically detected.
In the middle of the picture is the heart of our architecture “tomcat-rest”. When you talk to
you arrive there (and see a message that tells you you are missing our oAuth authorization – we are planning to publish our API soon). It hosts all of our REST services, which are implemented with Jersey. We chose Jersey, because it is very easy to implement REST Services with it, is a proven piece of software and also easily testable in unit and integration tests as outlined in this excellent blog post by Michael Lex about Writing lightweight REST integration tests with the Jersey Test Framework.
Services accessing any kind of data use our MongoDB backend, visualized by AppDynamics as blue boxes on the top. As a scalable database with easy access and painless schema changes (because there is no schema), it suits our performance requirements, as well as the changeability we require for a young product with frequent releases. If you want to know more about MongoDB I recommend reading Thomas Jaspers MongoDB tutorial. We talk to it via the Java Driver, have implemented replicasets and shard to multiple nodes. The major challenge here is to accept the eventual consistency. Even code which writes data might not yet be able to read it again in the next line.
Uploaded documents are stored on the Gluster backed XFS file system. All data is encrypted using a per user 256 AES key before it is persisted on the file system. Gluster then takes care of replicating the data to all servers.
For clients, we developed two native clients: CenterDevice for iPad and CenterDevice for Android Phone. The reason for native apps is that some emulated or cross compiled platforms have plenty of issues. Mark Zuckerberg also recently had a widely acknowledged talk about the problems HTML5 mobile clients have at the moment. And because iOS dominates the tablet market and Android rules the phone market, that is what we started with.
Note that our screenshots show cute baby animals, instead of boring documents, because they are more adorable.
Besides our native iPad and Android Phone application, and a few third party clients, the main user of the REST Server is our Vaadin 6 based web client. In fact we have two of them
Both are logically hosted on “tomcat-centerdevice”.
Communication between the web application servers and the rest server is only unidirectional, but sometimes we want to send back notifications (like new documents somebody just shared with you) to the webserver.
That is where RabbitMQ comes into play. The application map from AppDynamics shows our 3 usages of messaging:
- Sending Notifications from REST to Web.
- Sending requests to send e-Mails (currently sent from and consumed by the REST Server).
- Sending processing requests from REST server to doc-server.
RabbitMQ is set up using HA queues and worked flawlessly so far. Tobias Trelle wrote an introduction into RabbitMQ with Spring, which provides more background about Rabbit MQ and AMQP.
The document processing done by “doc-server”, which has multiple tasks depending on the type of input document:
- Generating PDF representations
- Generating preview images for different sizes
- Performing fulltext extraction
- Performing OCR
- Getting page count
- Obtaining additional metadata
- Detecting language
There are basically two types of documents that we use as basis: PDF and Images.
If we are getting any other format, we use Libre Office to convert it to PDF, or Imagemagick to convert it to images.
ImageMagick is then also used to generate preview images in various sizes.
Depending on the type of document we can use Apache Tika to get the fulltext from the document. For documents where Tika cannot find a fulltext, we resort to OCR running on “tomcat-ocr”. OCR will be done using tesseract. Further metadata is extracted using Tika and custom detectors.
Search capabilities are provided by Apache Solr 3, running on “tomcat-solr”. Solr is running in Master Slave mode. One neat feature of CenterDevice is that is performs super fast search on everything we can extract from documents or their metadata.
As you can see in the screenshot, the overall performance is quite nice. However it largely depends on the documents uploaded for individual processing requests. AppDynamics automatically learns the normal behavior and alerts us when there is deviation. For uploads and downloads however, we turned this learning off. Depending on the document and the clients connectivity, there is just nothing like a normal time a client takes to upload or download a document. We however gather diagnostic data for extraordinary slow up/downloads to investigate in case of customer complaints. A similar story applies to the document processing. While in most cases also the learned baselines are good, sometimes they do not match. That is why in the screenshot the connection to our external processing services are red. Some heavy document processing was going on at the time the screenshot was taken. When performance degrades, AppDynamics captures important metrics and provides us code level insight into the root cause. So far the most issues were typical like too many queries, too much API calls, inefficient indicies etc.
We always have ideas for improving stuff and are moving fast, so the architecture will change. While currently we do not have a pressing need, these changes will be most likely coming in future:
* Introduce WebSession replication for failover using memcache or redis. (currently we loose web session data on failure (not happened so far), and deployments (happening during nights when no sessions are alive)). The major challenge so far seems to get it integrated into the Servlet 3 async pushing we do.
* Switch from Gluster to Hadoop File System (We were bitten by lots of Gluster problems, like running on ext-4 64bit, which we now changed to xfs)
* Switch from Solr 3 to Elastic Search (the master-slave failover just does not work as nicely and does not scale, plus changes to Solr 3 need downtime)
* Add a reverse proxy / loadbalancer layer to perform green / blue deployments, redirect specific users to certain versions of the software.
Join the team!
I hope you enjoyed the overview on architecture and software we use. If you are interested in helping us build and grow this stack, we have good news for you: CenterDevice and codecentric are hiring! We are especially looking for a dev-opsy Linux and Hardware enthusiast to build out what I described above. Get in touch!