Jay Taylor's notesback to listing index
The CenterDevice Cloud Architecture - codecentric AG Blog[web search]
The CenterDevice Cloud Architecture
02/14/13 by Fabian Lange
In this post I want to give you an insight into the architecture of CenterDevice, a document management and collaboration tool for the enterprise hosted in our own cloud datacenter in Germany. CenterDevice is a startup of codecentric, with a few codecentrics working full or part time on it.
Let me start with a screenshot from our AppDynamics monitoring.
The green boxes are all Java 7 based services monitored with AppDynamics. The number “4” within the blue circle indicates that at the moment there are currently 4 instances each running. This number can change when services are scaled up and down. We currently run a mix of KVM virtualized and non-virtualized CentOS instances on high end Dell Machines with a total of about 148 cores, 600GB memory and 150TB disc space.
The lines connecting the boxes indicate call directions, their average response times and how often they were made in that time interval. What is really great about AppDynamics is that the services, and how they talk to each other is automatically detected.
In the middle of the picture is the heart of our architecture “tomcat-rest”. When you talk to
you arrive there (and see a message that tells you you are missing our oAuth authorization – we are planning to publish our API soon). It hosts all of our REST services, which are implemented with Jersey. We chose Jersey, because it is very easy to implement REST Services with it, is a proven piece of software and also easily testable in unit and integration tests as outlined in this excellent blog post by Michael Lex about Writing lightweight REST integration tests with the Jersey Test Framework.
Services accessing any kind of data use our MongoDB backend, visualized by AppDynamics as blue boxes on the top. As a scalable database with easy access and painless schema changes (because there is no schema), it suits our performance requirements, as well as the changeability we require for a young product with frequent releases. If you want to know more about MongoDB I recommend reading Thomas Jaspers MongoDB tutorial. We talk to it via the Java Driver, have implemented replicasets and shard to multiple nodes. The major challenge here is to accept the eventual consistency. Even code which writes data might not yet be able to read it again in the next line.
Uploaded documents are stored on the Gluster backed XFS file system. All data is encrypted using a per user 256 AES key before it is persisted on the file system. Gluster then takes care of replicating the data to all servers.
For clients, we developed two native clients: CenterDevice for iPad and CenterDevice for Android Phone. The reason for native apps is that some emulated or cross compiled platforms have plenty of issues. Mark Zuckerberg also recently had a widely acknowledged talk about the problems HTML5 mobile clients have at the moment. And because iOS dominates the tablet market and Android rules the phone market, that is what we started with.
Note that our screenshots show cute baby animals, instead of boring documents, because they are more adorable.
Besides our native iPad and Android Phone application, and a few third party clients, the main user of the REST Server is our Vaadin 6 based web client. In fact we have two of them
Both are logically hosted on “tomcat-centerdevice”.
Communication between the web application servers and the rest server is only unidirectional, but sometimes we want to send back notifications (like new documents somebody just shared with you) to the webserver.
That is where RabbitMQ comes into play. The application map from AppDynamics shows our 3 usages of messaging:
- Sending Notifications from REST to Web.
- Sending requests to send e-Mails (currently sent from and consumed by the REST Server).
- Sending processing requests from REST server to doc-server.
RabbitMQ is set up using HA queues and worked flawlessly so far. Tobias Trelle wrote an introduction into RabbitMQ with Spring, which provides more background about Rabbit MQ and AMQP.
The document processing done by “doc-server”, which has multiple tasks depending on the type of input document:
- Generating PDF representations
- Generating preview images for different sizes
- Performing fulltext extraction
- Performing OCR
- Getting page count
- Obtaining additional metadata
- Detecting language
There are basically two types of documents that we use as basis: PDF and Images.
If we are getting any other format, we use Libre Office to convert it to PDF, or Imagemagick to convert it to images.
ImageMagick is then also used to generate preview images in various sizes.
Depending on the type of document we can use Apache Tika to get the fulltext from the document. For documents where Tika cannot find a fulltext, we resort to OCR running on “tomcat-ocr”. OCR will be done using tesseract. Further metadata is extracted using Tika and custom detectors.
Search capabilities are provided by Apache Solr 3, running on “tomcat-solr”. Solr is running in Master Slave mode. One neat feature of CenterDevice is that is performs super fast search on everything we can extract from documents or their metadata.
As you can see in the screenshot, the overall performance is quite nice. However it largely depends on the documents uploaded for individual processing requests. AppDynamics automatically learns the normal behavior and alerts us when there is deviation. For uploads and downloads however, we turned this learning off. Depending on the document and the clients connectivity, there is just nothing like a normal time a client takes to upload or download a document. We however gather diagnostic data for extraordinary slow up/downloads to investigate in case of customer complaints. A similar story applies to the document processing. While in most cases also the learned baselines are good, sometimes they do not match. That is why in the screenshot the connection to our external processing services are red. Some heavy document processing was going on at the time the screenshot was taken. When performance degrades, AppDynamics captures important metrics and provides us code level insight into the root cause. So far the most issues were typical like too many queries, too much API calls, inefficient indicies etc.
We always have ideas for improving stuff and are moving fast, so the architecture will change. While currently we do not have a pressing need, these changes will be most likely coming in future:
* Introduce WebSession replication for failover using memcache or redis. (currently we loose web session data on failure (not happened so far), and deployments (happening during nights when no sessions are alive)). The major challenge so far seems to get it integrated into the Servlet 3 async pushing we do.
* Switch from Gluster to Hadoop File System (We were bitten by lots of Gluster problems, like running on ext-4 64bit, which we now changed to xfs)
* Switch from Solr 3 to Elastic Search (the master-slave failover just does not work as nicely and does not scale, plus changes to Solr 3 need downtime)
* Add a reverse proxy / loadbalancer layer to perform green / blue deployments, redirect specific users to certain versions of the software.
Join the team!
I hope you enjoyed the overview on architecture and software we use. If you are interested in helping us build and grow this stack, we have good news for you: CenterDevice and codecentric are hiring! We are especially looking for a dev-opsy Linux and Hardware enthusiast to build out what I described above. Get in touch!
Post by Fabian Lange
More content about Cloud
16. February 2013 von joschi
Regarding the issues you’ve had with GlusterFS, have you taken a look at Ceph/CephFS as an alternative distributed filesystem? HDFS seems like the wrong hammer for the problem at hand to me.
Also, what has been your experience with tesseract so far? How does it compare to other solutions?
HDFS is a hammer, but it would fit for other plans we have. Thanks for suggesting Ceph. We originally did evaluate it. I was not involved, perhaps we need to reconsider it.
tesseract works really nice. We had a minor issue with the training data for rotation, but especially the rotation feature is really neat. We used cuneiform before and it could not rotate, and we feel that the OCR is working better with tesseract.
OCR however is not really black and white. some handle some documents better than others. We also evaluated Abby, which also produced nice results.
The big pro for tesseract is that it is free, backed by Google and actively developed.
16. February 2013 von joschi
nice article. I’ve heard about centerDevice from a friend. A thing that really suprises me is the use of Vaadin.
I guess that’s something interesting to discuss about.
I’ve tested a few different frameworks and finally I’ve decided to use Sencha and plain old JS which turns out to be one of the best ideas…
I’ve been able to easily change the behaviour of components and had full control over what happened..
unfortunately, you make statements without understanding what you are talking about.
- No, Vaadin does not the same as JSF
- No, Vaadin does not eliminate writing CSS
- No, Vaadin is not dynamically interpreted.
While I consider doing a server-less implementation sometimes, one will always end up using a framework (like you did when you chose Sencha). The reason is that certain components, like a table, just eat away too much development time.
When looking at Java Frameworks which allow you building RIA / SPA, you have basically two choices: Vaadin/GWT or Eclipse RAP.
If you accept a bit more page oriented, then Wicket, Tapestry or Spring MVC might be good.
In the end the choice is yours. But if you are only able to say what is (or what you claim to be) bad about others, and are not able to say what is good and bad about your solution, you will always run a suboptimal solution.
Perhaps you want to rephrase your comment, and I will tell you why I consider the current setup to be good.
Notify me of followup comments via e-mail