Jay Taylor's notes

back to listing index

The CenterDevice Cloud Architecture - codecentric AG Blog

[web search]
Original source (blog.codecentric.de)
Tags: architecture blog.codecentric.de
Clipped on: 2016-11-10

codecentric Blog

1000+ articles for developers by developers

The CenterDevice Cloud Architecture

02/14/13 by

5 Comments

In this post I want to give you an insight into the architecture of CenterDevice, a document management and collaboration tool for the enterprise hosted in our own cloud datacenter in Germany. CenterDevice is a startup of codecentric, with a few codecentrics working full or part time on it.

Let me start with a screenshot from our AppDynamics monitoring.
Image (Asset 3/20) alt=

The green boxes are all Java 7 based services monitored with AppDynamics. The number “4” within the blue circle indicates that at the moment there are currently 4 instances each running. This number can change when services are scaled up and down. We currently run a mix of KVM virtualized and non-virtualized CentOS instances on high end Dell Machines with a total of about 148 cores, 600GB memory and 150TB disc space.
The lines connecting the boxes indicate call directions, their average response times and how often they were made in that time interval. What is really great about AppDynamics is that the services, and how they talk to each other is automatically detected.

In the middle of the picture is the heart of our architecture “tomcat-rest”. When you talk to

api.centerdevice.de/v1

you arrive there (and see a message that tells you you are missing our oAuth authorization – we are planning to publish our API soon). It hosts all of our REST services, which are implemented with Jersey. We chose Jersey, because it is very easy to implement REST Services with it, is a proven piece of software and also easily testable in unit and integration tests as outlined in this excellent blog post by Michael Lex about Writing lightweight REST integration tests with the Jersey Test Framework.

Services accessing any kind of data use our MongoDB backend, visualized by AppDynamics as blue boxes on the top. As a scalable database with easy access and painless schema changes (because there is no schema), it suits our performance requirements, as well as the changeability we require for a young product with frequent releases. If you want to know more about MongoDB I recommend reading Thomas Jaspers MongoDB tutorial. We talk to it via the Java Driver, have implemented replicasets and shard to multiple nodes. The major challenge here is to accept the eventual consistency. Even code which writes data might not yet be able to read it again in the next line.

Uploaded documents are stored on the Gluster backed XFS file system. All data is encrypted using a per user 256 AES key before it is persisted on the file system. Gluster then takes care of replicating the data to all servers.

For clients, we developed two native clients: CenterDevice for iPad and CenterDevice for Android Phone. The reason for native apps is that some emulated or cross compiled platforms have plenty of issues. Mark Zuckerberg also recently had a widely acknowledged talk about the problems HTML5 mobile clients have at the moment. And because iOS dominates the tablet market and Android rules the phone market, that is what we started with.

Image (Asset 4/20) alt=

Note that our screenshots show cute baby animals, instead of boring documents, because they are more adorable.
Besides our native iPad and Android Phone application, and a few third party clients, the main user of the REST Server is our Vaadin 6 based web client. In fact we have two of them

app.centerdevice.de

and

public.centerdevice.de

Both are logically hosted on “tomcat-centerdevice”.
Vaadin, as a framework, allows us to implement complex rich web interfaces easily, using SWT Style Java code. It compiles to GWT JavaScript which is then delivered to the browser. The bulk of the work can be easily done in Vaadin, which we combine with CDI-utils, a plugin that allows implementing the MVP pattern easily using Weld as CDI library. Developing complex components however requires developing GWT Widgets yourself sometimes, which is not that easy. For easy copy to clipboard functionality, Henning Treu wrote a Copy to clipboard vaadin addon, which we open sourced.

Image (Asset 5/20) alt=

Communication between the web application servers and the rest server is only unidirectional, but sometimes we want to send back notifications (like new documents somebody just shared with you) to the webserver.

That is where RabbitMQ comes into play. The application map from AppDynamics shows our 3 usages of messaging:

  • Sending Notifications from REST to Web.
  • Sending requests to send e-Mails (currently sent from and consumed by the REST Server).
  • Sending processing requests from REST server to doc-server.

RabbitMQ is set up using HA queues and worked flawlessly so far. Tobias Trelle wrote an introduction into RabbitMQ with Spring, which provides more background about Rabbit MQ and AMQP.

The document processing done by “doc-server”, which has multiple tasks depending on the type of input document:

  • Generating PDF representations
  • Generating preview images for different sizes
  • Performing fulltext extraction
  • Performing OCR
  • Getting page count
  • Obtaining additional metadata
  • Detecting language

There are basically two types of documents that we use as basis: PDF and Images.
If we are getting any other format, we use Libre Office to convert it to PDF, or Imagemagick to convert it to images.
ImageMagick is then also used to generate preview images in various sizes.
Depending on the type of document we can use Apache Tika to get the fulltext from the document. For documents where Tika cannot find a fulltext, we resort to OCR running on “tomcat-ocr”. OCR will be done using tesseract. Further metadata is extracted using Tika and custom detectors.

Search capabilities are provided by Apache Solr 3, running on “tomcat-solr”. Solr is running in Master Slave mode. One neat feature of CenterDevice is that is performs super fast search on everything we can extract from documents or their metadata.

Performance

As you can see in the screenshot, the overall performance is quite nice. However it largely depends on the documents uploaded for individual processing requests. AppDynamics automatically learns the normal behavior and alerts us when there is deviation. For uploads and downloads however, we turned this learning off. Depending on the document and the clients connectivity, there is just nothing like a normal time a client takes to upload or download a document. We however gather diagnostic data for extraordinary slow up/downloads to investigate in case of customer complaints. A similar story applies to the document processing. While in most cases also the learned baselines are good, sometimes they do not match. That is why in the screenshot the connection to our external processing services are red. Some heavy document processing was going on at the time the screenshot was taken. When performance degrades, AppDynamics captures important metrics and provides us code level insight into the root cause. So far the most issues were typical like too many queries, too much API calls, inefficient indicies etc.

ToDos

We always have ideas for improving stuff and are moving fast, so the architecture will change. While currently we do not have a pressing need, these changes will be most likely coming in future:

* Introduce WebSession replication for failover using memcache or redis. (currently we loose web session data on failure (not happened so far), and deployments (happening during nights when no sessions are alive)). The major challenge so far seems to get it integrated into the Servlet 3 async pushing we do.
* Switch from Gluster to Hadoop File System (We were bitten by lots of Gluster problems, like running on ext-4 64bit, which we now changed to xfs)
* Switch from Solr 3 to Elastic Search (the master-slave failover just does not work as nicely and does not scale, plus changes to Solr 3 need downtime)
* Add a reverse proxy / loadbalancer layer to perform green / blue deployments, redirect specific users to certain versions of the software.

Join the team!

I hope you enjoyed the overview on architecture and software we use. If you are interested in helping us build and grow this stack, we have good news for you: CenterDevice and codecentric are hiring! We are especially looking for a dev-opsy Linux and Hardware enthusiast to build out what I described above. Get in touch!

Kommentare

  • 16. February 2013 von joschi

    Regarding the issues you’ve had with GlusterFS, have you taken a look at Ceph/CephFS as an alternative distributed filesystem? HDFS seems like the wrong hammer for the problem at hand to me.

    Also, what has been your experience with tesseract so far? How does it compare to other solutions?

  • Image (Asset 15/20) alt=

    Hi Joschi,
    HDFS is a hammer, but it would fit for other plans we have. Thanks for suggesting Ceph. We originally did evaluate it. I was not involved, perhaps we need to reconsider it.

    tesseract works really nice. We had a minor issue with the training data for rotation, but especially the rotation feature is really neat. We used cuneiform before and it could not rotate, and we feel that the OCR is working better with tesseract.
    OCR however is not really black and white. some handle some documents better than others. We also evaluated Abby, which also produced nice results.
    The big pro for tesseract is that it is free, backed by Google and actively developed.

    Fabian

  • Image (Asset 16/20) alt=

    16. February 2013 von joschi

    Thanks for elaborating! Image (Asset 17/20) alt=

  • Image (Asset 18/20) alt=

    Hey,

    nice article. I’ve heard about centerDevice from a friend. A thing that really suprises me is the use of Vaadin.

    I sometimes don’t understand why people use frameworks like these. We’ve had a lot of pain using JSF which essentially does the same. It glues together both back and frontend. The actual developer doesn’t need to write any Javascript etc. Vaadin seems to go one step further and also eliminates the use of writing HTML and CSS.

    But is it really a good idea, to write something in a language which dynamically gets interpreted into another? Wouldn’t it be better to write the frontend in plain old Javascript/HTML/CSS?

    I guess that’s something interesting to discuss about.

    In my previous job, I’ve written a complete frontend in Javascript for a big ECM-System:
    http://www.youtube.com/watch?v=HHLM_Ucc6WA

    I’ve tested a few different frameworks and finally I’ve decided to use Sencha and plain old JS which turns out to be one of the best ideas…

    I’ve been able to easily change the behaviour of components and had full control over what happened.. Image (Asset 19/20) alt=

    • Image (Asset 20/20) alt=

      Hi Christian,
      unfortunately, you make statements without understanding what you are talking about.

      • No, Vaadin does not the same as JSF
      • No, Vaadin does not eliminate writing CSS
      • No, Vaadin is not dynamically interpreted.

      While I consider doing a server-less implementation sometimes, one will always end up using a framework (like you did when you chose Sencha). The reason is that certain components, like a table, just eat away too much development time.

      When looking at Java Frameworks which allow you building RIA / SPA, you have basically two choices: Vaadin/GWT or Eclipse RAP.
      If you accept a bit more page oriented, then Wicket, Tapestry or Spring MVC might be good.

      In the end the choice is yours. But if you are only able to say what is (or what you claim to be) bad about others, and are not able to say what is good and bad about your solution, you will always run a suboptimal solution.

      Perhaps you want to rephrase your comment, and I will tell you why I consider the current setup to be good.

      Fabian

Comment

Notify me of followup comments via e-mail