Pentaho BI: The Tutorial You Were Waiting for!

by James Warner about a month ago in product review

Pentaho is one of the most popular open-source business intelligence suites in the market today.

Pentaho BI: The Tutorial You Were Waiting for!

Pentaho is one of the most popular open-source business intelligence suites in the market today. From Data integration to report generation and analysis, Pentaho is fast changing the BI scenario. Kettle, an open-source Pentaho Data Integration tool, is by far one of the best available in the category.

The real game-changer, however, was the drag and drop options as well as easy step-by-step wizards that Pentaho developed in recent years for almost all tasks. It helped companies cut down on training costs by thousands of rupees every month.

Today, we will be discussing Pentaho’s Data Integration (PDI) tool and how to make efficient use of it.

Where to use PDI in the Real World?

While PDI sounds great in theory, where can we make use of it in real-world scenarios? Listed below are a series of ideal scenarios where PDI can serve as a savior.

Loading data warehouse or data marts

Loading data warehouses include three necessary steps:

1) Extracting - One needs to extract data from CSV files, XML files or other sources before working on them.

2) Transforming - Filtering irrelevant data, performing arithmetic operations, changing data types come under this. Transforming is done to make the data fit for analytical use.

3) Loading - The transformed information is either loaded back to the database or is used to overwrite the existing data.

Integrating data

Data integration has hundreds of applications. Two companies merging similar data sets to have a wholesome view of the features and two different departments merging data from completely different management applications to relate performances of one concerning the other department are just a few of such scenarios.

Data cleansing

Kettle helps cleanse data by eliminating inaccurate results and data, data that do not fulfill certain criteria or data that is irrelevant. The cleaner and more accurate the data, the better the chances of getting an accurate report and business strategy analysis at the end of the process.

Exporting Data

The government might ask for data, the different departments of your company might ask for data, your business report might require integration of data. For this very job, Kettle supports data exporting to external drives.

Step by Step guide to install and use Pentaho

Prerequisites: JRE 6.0 should be preinstalled in the system. If it isn’t, go to and download the same.

Operating System: Linux, Mac OS, Windows

Installing PDI:

1. Log onto Integration.

2. Download the latest version that matches your system specification.

3. Unzip the downloaded folder. If you are on Windows, your installation is complete. If you’re on Linux or a similar environment, you must make the scripts executable.

Assuming your folder path is: /home/kettle

Execute: cd/home/kettle

chmod +x *.sh

If you are on Mac, do not forget to give execute permissions to the JavaApplicationStub file.

Launching Spoon:

A spoon is the desktop designer tool that helps you manipulate data. It is the PDI graphical designer.

1. On a Windows system, run Spoon.bat. To arrive at this, select Run from the Start menu, type ‘cmd’ and press enter. Once the terminal opens, run Spoon.bat on a UNIX environment, just open a terminal and type If you’re on a Mac, just execute the JavaApplicationStub file.

2. As soon as you see a dialogue box asking for the repository connection, click Cancel.

3. Now, you will see a small window labeled Spoon Tips. You can explore this a bit before closing it.

4. Finally, we appear on the main screen that says Welcome!

5. Go on the menu bar and click on Tools, then select Options… And now you can go crazy. Click on the Look & Feel tab to change the appearance of your grid.

Creating a connection to the DI repository

Now it’s time for some serious work – transformations. But before we do that, we need data to work with and that can be found once we connect to a data integration repository.

1. From the Tools menu, select Repository > Connect.

2. Click the Add button to get to the Select Repository Type Window. Once there, select DI Repository: DI Repository and click on OK. Once you enter the Repository Configuration window, enter a Name and Description for your repository. If you wish, you can modify the URL. If you do so, click on the Test to ensure the modifications are correct. If the test fails, make sure the DI server is running and the port number in the URL is correct.

3. Click OK to close the success dialogue box.

4. Close the Repository Config window by clicking OK.

5. To connect to the repository, click on the repository name, type in the username and password and enjoy.

Now that we have covered the very basics of PDI, you just have to explore and let your experimental side take over to build what you wish to. Reports, transformations, and analysis are clicks away!

Though this tutorial was merely the beginning, if not configured properly your Pentaho would yield no results. So take your time when setting up PDI. Once you are done with the trivialities, you can go ahead and start carrying out your tasks. If you can use Facebook, Pentaho won’t be a challenge in the least.

product review
James Warner
James Warner
Read next: Wearable Technology: The Good, The Bad, The (Literally) Ugly
James Warner

James Warner is a Business Analyst / Business Intelligence Analyst as well as experienced programming and Software Developer with Excellent knowledge on Hadoop/Big data analysis, testing and deployment of software systems at NexSoftSys.

See all posts by James Warner