Hadoop | Albatrosa Ltd

If you are familiar with Tableau and would like to learn about big data connectivity, then this post is for you. It is a step-by-step guide to connecting tableau to Impala / Big Data using the familiar superstore data set.

We will cover:

System requirements
Downloading the necessary software, including a Virtual Machine
How to create and install a Hadoop Cloudera Big Data virtual machine
Load the sample Superstore Tableau CSV
Creating your visualisation in Tableau

At the end of this exercise, you should have a fully working Impala database which you can further use to learn more about Hadoop and big data. We will be covering advanced examples in future posts.

1. System Requirements

64 bit operating system, including Windows 7 and 8
At least 6 Gbytes of RAM. 8 Gbytes recommended
4 Gbytes of disk space
If you store the virtual machine on an external drive, then a USB 3.0 connection is recommended. It will work with USB 2.0 but it will be much slower

2. Software

2.1 VMware Player

The best way to explore Hadoop is by using a virtual machine. We will be using a Cloudera image in this blog. You will need VMware Player for Windows and Linux, or VMware Fusion for Mac.

VMware player can be download directly from VMware.

VMware Fusion only works on Intel architectures, so older Macs with PowerPC processors cannot run the QuickStart VM.

2.2 7-Zip

The Cloudera Hadoop virtual machine image is compressed with 7-Zip – an open source file archiver used to compress files. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. 7-Zip has superior compression to Zip and can be dowloaded from the open source repository SourceForge.

2.3 Cloudera VM Image

The easiest way to start testing Hadoop is with Cloudera’s QuickStart Virtual Machine (VM). The QuickStart VMs contain a single-node Apache Hadoop cluster, complete with all the software to configure the system, upload sample data, and set-up the database.

The virtual machine can be download from Cloudera’s web site. Select the latest version (5.13.x at the time of writing) and click Download for VMware

Unzip the downloaded file using 7.zip and store the virtual machine on either your hard disk or an external disk.

Some Virtual machine files are over 500 Mbytes. Avoid using USB storage (even 3,0) as you may experience a very slow response.

2.4 Cloudera ODBC Connector

To connect Tableau to Impala (big data database), you need to install the ODBC connector which can be downloaded directly from Cloudera. Select ODBC Drivers and Connectors and download the driver for the operating system you have Tableau installed on.

2.5 Prepare the Tableau Data

We will use the sample Tableau superstore data to test our big data database. HIVE / Impala do not support a Date format (only Date/Time). Fortunately Tableau can do the date casting from a string.

We have created a new field, ShipDate2, in the format of YYYY-MM-DD to get around this limitation.

HIVE / Impala, can read data from a variety of formats, including comma and tab delimited, but not excel. We have opted for tab delimited as some fields contain commas. To save you the trouble of creating the updated excel file and saving it in tab delimited format, you can download the files from superstore.xlsx and superstore.txt

Finally you need to you need to upload superstore.txt to a location where you can then download it in the virtual machine. The Albatrosa team suggests a cloud based service such as Dropbox, Box, OneDrive, and / or Google Drive.

An alternative is to store the file on a shared folder on your PC / Mac and creating a virtual directory in VM Player. We opted for a cloud drive as it is easier then setting up a shared drive.

3. Install and Configure Hadoop Cloudera Virtual Machine

3.1 Loading the Virtual Machine

Launch VM Player, click on Open Virtual Machine, and go to the location where you have unzipped the file.

You are now ready to launch the virtual machine and your screen should look similar to the following. Click on Play Virtual Machine. The virtual machine takes time to start. This might be a good time to go and get a coffee.

The Albatrosa Consultants suggest closing as many applications as possible before launching the virtual machine as it will consume approximately 4 Gbytes of memory.

3.2 Install and Configure Hadoop Cloudera Virtual Machine

Once you launch the VM, you are automatically logged in as the cloudera user. Firefox will launch automatically and you are presented with the following screen. Click on Hue menu to get started.

If prompted for a username and password then use cloudera for both. From the Quick Start, click Next again.

Click on Hive Editor and Impala Editor to install the sample data sets.

3.3 Test the Virtual Machine

Let’s first check that Impala (database) is running well.Select Impala from the Query Editors Menu menu

Type select count(*) from sample_07; in the editor and press the Execute button.

If all goes well, a count of 823 is returned

4. Load and Test the Sample Data

4.1 Load the Superstore Data

You are now ready to load the Tableau sample superstore data set and configure Impala to process it. In Firefox, open the cloud based storage where you have saved superstore.txt and download the file to the Downloads folder.

To load the file, click on Metastore Tables in Database Browser menu.

Complete the form as shown below.

For the input file, click on . . to load the file from the Downloads folder.

When done, click Create Table to set-up the database. Accept the defaults and go through the next steps.

4.2 Test the database

To test the database, click Browse the data and you should get a listing of the table schema.

5.1 Creating your visualisation in Tableau

5.1 IP Address

To connect Tableau to Impala, you need the virtual machine IP address. Click on terminal and then type /sbin/ifconfig and keep a note of the IP address.

5.2 Launch Tableau

Launch Tableau with blank workbook. Click on Connect to data and select Cloudera Hadoop as connection type. Fill the parameter as per below replacing the IP address with the one you have retrieved earlier.

If you don’t manage to login to Impala without Authentication, then try authenticate by username and password. use cloudera for both.

Under Connect to dataSelect a Schema, press the search button, and click on Connect to datadefault schema. For table name, press the search button, and then select superstore.txt.

The next step is to to change the data type for Shipdate2 String to date.

Almost there …. Drag some of the fields to the workbook. You should have a fully functional superstore sample which you can use in Tableau.

5.3 Sample Workbook

The Albatrosa team have also created a sample workbook which you can download to test run Impala.

6. Conclusion

It’s that simple. We hope you have found this post useful and look forward to your comment.

Albatrosa is a London based consultancy that specialises in big data analytics and visualisation. Our team of consultants help our clients build the processes and dashboards that enable them to analyse business data and make better informed, faster decisions. Contact us to start your Tableau project.

Albatrosa Ltd

Data . Insight . Action

Tag Archives: Hadoop

Step by Step Guide to Building your Tableau Visualisation Big Data Project