Jay Taylor's notes

back to listing index

src-d/engine

[web search]

Original source (github.com)

Tags: scala git spark source-code-analysis hdfs github.com

Clipped on: 2018-07-30

You have unread notifications

Watch 14
64
Fork

35

src-d/engine

Code Issues 10 Pull requests 4 Wiki Insights

engine is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis. https://engine.sourced.tech/

spark pyspark scala python git datasource

Scala Python Shell Jupyter Notebook Makefile Dockerfile

New pull request

Upload files Find file

Clone or download

ajnavarro Merge pull request #416 from erizocosmico/feature/fix-dirty-docker-build

Latest commit 408aca8 6 days ago

_examples	examples: unify examples folders	3 months ago
documentation/proposals	add proposal for making getFirstReferenceCommit default behavior	7 months ago
project	test: multiple Spark versions	28 days ago
python	test: multiple Spark versions	28 days ago
src	Merge pull request #397 from turtlemonvh/patch-1	12 days ago
.dockerignore	replace makefile build with docker multi-stage build	6 months ago
.gitignore	release to maven central and push uber jar to github releases on release	9 months ago
.travis.yml	test: multiple Spark versions	28 days ago
CODE_OF_CONDUCT.md	update contact email	7 months ago
DCO	Add DCO file	2 months ago
Dockerfile	Install bblfsh on docker image.	a month ago
ISSUE_TEMPLATE.md	require versions in issue template	3 months ago
LICENSE	Initial commit	a year ago
MAINTAINERS	Create MAINTAINERS	7 months ago
Makefile	makefile: only add -dirty suffix when the repo is dirty	10 days ago
README.md	docs: add python pre-requisites	18 days ago
build.sbt	Change shaded package name	3 months ago
key.asc.enc	release to maven central and push uber jar to github releases on release	9 months ago
sbt	Add sbt build wrapper	11 months ago
scalastyle-config.xml	get only the last commit of a reference by default	6 months ago

README.md

engine

engine is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

Current implementation combines:

src-d/enry to detect programming language of every file
bblfsh/client-scala to parse every file to UAST
src-d/siva-java for reading Siva files in JVM
apache/spark to extend DataFrame API
eclipse/jgit for working with Git .pack files

Quick-start

First, you need to download Apache Spark somewhere on your machine:

$ cd /tmp && wget "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz" -O spark-2.2.1-bin-hadoop2.7.tgz

The Apache Software Foundation suggests you the better mirror where you can download Spark from. If you wish to take a look and find the best option in your case, you can do it here.

Then you must extract Spark from the downloaded tar file:

$ tar -C ~/ -xvzf spark-2.2.1-bin-hadoop2.7.tgz

Binaries and scripts to run Spark are located in spark-2.2.1-bin-hadoop2.7/bin, so should set PATH and SPARK_HOME to point to this directory. It's advised to add this to your shell profile:

$ export SPARK_HOME=$HOME/spark-2.2.1-bin-hadoop2.7
$ export PATH=$PATH:$SPARK_HOME/bin

Look for the latest engine version, and then replace in the command where [version] is showed:

$ spark-shell --packages "tech.sourced:engine:[version]"

# or

$ pyspark --packages "tech.sourced:engine:[version]"

Run bblfsh daemon. You can start it easily in a container following its quick start guide.

If you run engine in an UNIX like environment, you should set the LANG variable properly:

export LANG="en_US.UTF-8"

The rationale behind this is that UNIX file systems don't keep the encoding for each file name, they are just plain bytes, so the Java API for FS looks for the LANG environment variable to apply certain encoding.

Either in case the LANG variable wouldn't be set to a UTF-8 encoding or it wouldn't be set at all (which results in handle encoding in C locale) you could get an exception during the engine execution similar to java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters.

Pre-requisites

Scala 2.11.x
Apache Spark Installation 2.2.x or 2.3.x
bblfsh >= 2.5.0: Used for UAST extraction

Python pre-requisites:

Python >= 3.4.x (engine is tested with Python 3.4, 3.5 and 3.6 and these are the supported versions, even if it might still work with previous ones)
libxml2-dev installed
python3-dev installed
g++ installed

Examples of engine usage

engine is available on maven central. To add it to your project as a dependency,

For projects managed by maven add the following to your pom.xml:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>engine</artifactId>
    <version>[version]</version>
</dependency>

For sbt managed projects add the dependency:

libraryDependencies += "tech.sourced" % "engine" % "[version]"

In both cases, replace [version] with the latest engine version

Usage in applications as a dependency

The default jar published is a fatjar containing all the dependencies required by the engine. It's meant to be used directly as a jar or through --packages for Spark usage.

If you want to use it in an application and built a fatjar with that you need to follow these steps to use what we call the "slim" jar:

With maven:

<dependency>
    <groupId>tech.sourced</groupId>
    <artifactId>engine</artifactId>
    <version>[version]</version>
    <classifier>slim</classifier>
</dependency>

Or (for sbt):

libraryDependencies += "tech.sourced" % "engine" % "[version]" % Compile classifier "slim"

If you run into problems with io.netty.versions.properties on sbt, you can add the following snippet to solve it:

In sbt:

assemblyMergeStrategy in assembly := {
  case "META-INF/io.netty.versions.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

pyspark

Local mode

Install python-wrappers is necessary to use engine from pyspark:

$ pip install sourced-engine

Then you should provide the engine's maven coordinates to the pyspark's shell:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:engine:[version]"

Replace [version] with the latest engine version

Cluster mode

Install engine wrappers as in local mode:

$ pip install -e sourced-engine

Then you should package and compress with zip the python wrappers to provide pyspark with it. It's required to distribute the code among the nodes of the cluster.

$ zip <path-to-installed-package> ./sourced-engine.zip
$ $SPARK_HOME/bin/pyspark <same-args-as-local-plus> --py-files ./sourced-engine.zip

pyspark API usage

Run pyspark as explained before to start using the engine, replacing [version] with the latest engine version:

$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:engine:[version]"
Welcome to

   spark version 2.2.1

Using Python version 3.6.2 (default, Jul 20 2017 03:52:27)
SparkSession available as 'spark'.
>>> from sourced.engine import Engine
>>> engine = Engine(spark, '/path/to/siva/files', 'siva')
>>> engine.repositories.filter('id = "github.com/mingrammer/funmath.git"').references.filter("name = 'refs/heads/HEAD'").show()
+--------------------+---------------+--------------------+
|       repository_id|           name|                hash|
+--------------------+---------------+--------------------+
|github.com/mingra...|refs/heads/HEAD|290440b64a73f5c7e...|
+--------------------+---------------+--------------------+

Scala API usage

You must provide engine as a dependency in the following way, replacing [version] with the latest engine version:

$ spark-shell --packages "tech.sourced:engine:[version]"

To start using engine from the shell you must import everything inside the tech.sourced.engine package (or, if you prefer, just import Engine and EngineDataFrame classes):

scala> import tech.sourced.engine._
import tech.sourced.engine._

Now, you need to create an instance of Engine and give it the spark session and the path of the directory containing the siva files:

scala> val engine = Engine(spark, "/path/to/siva-files", "siva")

Then, you will be able to perform queries over the repositories:

scala> engine.getRepositories.filter('id === "github.com/mawag/faq-xiyoulinux").
     | getReferences.filter('name === "refs/heads/HEAD").
     | getAllReferenceCommits.filter('message.contains("Initial")).
     | select('repository_id, 'hash, 'message).
     | show

     +--------------------------------+-------------------------------+--------------------+
     |                 repository_id|                                hash|          message|
     +--------------------------------+-------------------------------+--------------------+
     |github.com/mawag/...|fff7062de8474d10a...|Initial commit|
     +--------------------------------+-------------------------------+--------------------+

Supported repository formats

As you might have seen, you need to provide the repository format you will be reading when you create the Engine instance. Although the documentation always uses the siva format, there are more repository formats available.

These are all the supported formats at the moment:

siva: rooted repositories packed in a single .siva file.
standard: regular git repositories with a .git folder. Each in a folder of their own under the given repository path.
bare: git bare repositories. Each in a folder of their own under the given repository path.

Processing local repositories with the engine

There are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:

All local branches will belong to a repository whose id is file://$REPOSITORY_PATH. So, if you clone https://github.com/foo/bar.git at /home/foo/bar, you will see two repositories file:///home/foo/bar and github.com/foo/bar, even if you only have one.
Remote branches are transformed from refs/remote/$REMOTE_NAME/$BRANCH_NAME to refs/heads/$BRANCH_NAME as they will only belong to the repository id of their corresponding remote. So refs/remote/origin/HEAD becomes refs/heads/HEAD.

Playing around with engine on Jupyter

You can launch our docker container which contains some Notebooks examples just running:

docker run --name engine-jupyter --rm -it -p 8080:8080 -v $(pwd)/path/to/siva-files:/repositories --link bblfshd:bblfshd srcd/engine-jupyter

You must have some siva files in local to mount them on the container replacing the path $(pwd)/path/to/siva-files. You can get some siva-files from the project here.

You should have a bblfsh daemon container running to link the jupyter container (see Pre-requisites).

When the engine-jupyter container starts it will show you an URL that you can open in your browser.

Using engine directly from Python

If you are using engine directly from Python and are unable to modify the PYTHON_SUBMIT_ARGS you can copy the engine jar to the pyspark jars to make it available there.

cp engine.jar "$(python -c 'import pyspark; print(pyspark.__path__[0])')/jars"

This way, you can use it in the following way:

import sys

pyspark_path = "/path/to/pyspark/python"
sys.path.append(pyspark_path)

from pyspark.sql import SparkSession
from sourced.engine import Engine

siva_folder = "/path/to/siva-files"
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
engine = Engine(spark, siva_folder, 'siva')

Development

Build fatjar

Build the fatjar is needed to build the docker image that contains the jupyter server, or test changes in spark-shell just passing the jar with --jars flag:

$ make build

It leaves the fatjar in target/scala-2.11/engine-uber.jar

Build and run docker to get a Jupyter server

To build an image with the last built of the project:

$ make docker-build

Notebooks under examples folder will be included on the image.

To run a container with the Jupyter server:

$ make docker-run

Before run the jupyter container you must run a bblfsh daemon:

$ make docker-bblfsh

If it's the first time you run the bblfsh daemon, you must install the drivers:

$ make docker-bblfsh-install-drivers

To see installed drivers:

$ make docker-bblfsh-list-drivers

To remove the development jupyter image generated:

$ make docker-clean

Run tests

engine uses bblfsh so you need an instance of a bblfsh server running:

$ make docker-bblfsh

To run tests:

$ make test

To run tests for python wrapper:

$ cd python
$ make test

Windows support

There is no windows support in enry-java or bblfsh's client-scala right now, so all the language detection and UAST features are not available for the windows platform.

Code of Conduct

See CODE_OF_CONDUCT.md

License

Apache License Version 2.0, see LICENSE

Terms
Privacy
Security
Status
Help

Press h to open a hovercard with more details.