(String: {%- set hs_blog_post_body -%} {%- set in_blog_post_body = true -%} <span id="hs_cos_wrapper_post_body" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_rich_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="rich_text"> <div class="blog-post__lead h2"> <p>Lately, I've stumbled upon a Java class that was performing the exact task I had on my mind when starting to write my gem. The class is extracting text from PDF while keeping the text structure.</p> </div></span>)

How to Wrap Arbitrary Java Class in JRuby Gem?

Photo of Michał Kulesza

Michał Kulesza

Updated Feb 27, 2024 • 6 min read
max-nelson-492729-unsplash-1

Lately, I've stumbled upon a Java class that was performing the exact task I had on my mind when starting to write my gem. The class is extracting text from PDF while keeping the text structure.

I was a Java developer once, but I wanted my project to still use Ruby.
"Let's wrap it in JRuby gem!" - came to my mind. I started googling and found excellent tutorials on this topic. However, each of them covered wrapping jar package, rather than single class. I started looking for the solution even deeper and found answers in different places on the web. I decided to wrap it in this post.

So firstly, let me introduce The Java Class: PDFLayoutTextStripper. This class is very standard (when it comes to Java world standards). One important thing that it's missing is package definition. Packages in java world can be translated to modules in Ruby. The tutorial I found, assumed every Java class is namespaced by package name - and to be honest I didn't want to change the class signature. I spotted a challenge here :)

Ok, let's start. I mentioned a gem, right? But before we create a gem we need to ensure that we are using JRuby:


❯ ruby -v
jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server VM 9.0.4+11 on 9.0.4+11 +jit [darwin-x86_64]


To create a gem I went a standard way mentioned in Bundler guide:
❯ bundle gem pdf-textstream # naming things is a second hardest thing in IT, right?
Sadly, because we will be using Java native code, our gem will be only JRuby compatible. To ensure that it will be executed only on JVM, you have to modify the pdf-textstream.gemspec file and set platform parameter:
spec.platform = 'java'

The wrapper code will be residing in lib/pdf/textream.rb. Let me walk you through it, line by line.
require "pdf/textstream/version"
require "java"


To use Java classes (also Java stdlib, and even to reference the Java code directly), we have to require the java module.

The next thing is to require Java jars in a ruby way:
# load jars
require_relative "../../jars/pdfbox-2.0.6.jar"
require_relative "../../jars/commons-logging-1.2.jar"
require_relative "../../jars/fontbox-2.0.6.jar"

Those are dependencies of the introduced class. Of course, you have to download and put them in `jars` directory and distribute their compiled versions together with your gem.

The next important line is classpath definition:
$CLASSPATH << "#{File.expand_path(File.dirname(__FILE__))}/../../classes"
module Pdf
module Textstream

Classpath, for those with background in Java, is pretty straightforward. It is the directory, where JVM is looking for the included libraries. In fact, there is no directory named classes in our project. The Java compiler will automatically create it. But we still don't have the compiler in place.
Probably - it's not the best practice, but I included the build file that executes the following command:

javac -d classes -cp .:./jars/pdfbox-2.0.6.jar:./jars/commons-logging-1.2.jar:./jars/fontbox-2.0.6.jar *.java

You should manually execute this command each time you modify Java class or change dependencies.
And finally, the magic bits. First, copy the Java class to the root directory of your gem. Then, by using JRuby as a proxy, we can reference it:
PDFLayoutTextStripper = JavaUtilities.get_proxy_class("PDFLayoutTextStripper")

Next thing I did, is that shortened namespaces of classes I use. Each Java class can be referenced in a Ruby way by going through Java module tree:
# change namespace
PDFParser = Java::OrgApachePdfboxPdfparser::PDFParser
RandomAccessFile = Java::OrgApachePdfboxIo::RandomAccessFile
PDDocument = Java::OrgApachePdfboxPdmodel::PDDocument
PDFTextStripper = Java::OrgApachePdfboxText::PDFTextStripper

To execute the class, and run it on file located in given path I created a static method:

def self.file_path_to_text(path)
    # TODO: exception handling
    pdfParser = PDFParser.new(RandomAccessFile.new(Java::JavaIo::File.new(path), "r"))
    pdfParser.parse()
    pdDocument = PDDocument.new(pdfParser.getDocument());
    pdfTextStripper = PDFLayoutTextStripper.new
    string = pdfTextStripper.getText(pdDocument);
    return string
end

It initiates PDF reader, parses the PDF file, passes the document to our arbitrary class and returns the string it read.The trickiest part of it was that I was trying to pass the Ruby file handle to the PDFParser as an argument. Of course, it failed. PDFParser signature expects File handle from a Java world. It was something new to me, that's why I had to read the file "the Java way": RandomAccessFile.new(Java::JavaIo::File.new(path), "r")

And… that’s it! Your Java class packed as a gem is ready to use!You can find the gem in my GitHub repo. Please keep in mind it was created as a proof of concept and is not ready for production use.


Photo by Max Nelson on Unsplash

Photo of Michał Kulesza

More posts by this author

Michał Kulesza

Looking for new challenges, Michał made a switch from the world of Java to the mighty Ruby on...
How to build products fast?  We've just answered the question in our Digital Acceleration Editorial  Sign up to get access

We're Netguru!

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency
Let's talk business!

Trusted by: