Hive – Update/Insert/Delete

Hi Folks,

Today i am going to show you how we can perform the acid activity on hive. As we all know that hive is build on top of hdfs where hdfs doesn’t support alter file in between.

First understand what is acid semantics

1.  Atomicity :- Any transaction would either be completed or failed there would not be any hand state

2. Consistency:- After completion of operation data will be visible same to every other operation

3. Isolation:- Operation should not impact other while running.

4. Durability:- Operation  output should be remain same after commit, it should be changed if machine / system fails

Lets see how we can perform the above mentions task on hive but before that i want you to know that its not a use case of OLTP database but a fulfillment of small requirement where we can update small amount of data if required.

Properties we need to set before proceeding to transaction tables

  1. hive.support.concurrency  = TRUE
  2. hive.enforce.bucketing = TRUE
  3. hive.exec.dynamic.partition.mode = NONSTRICT
  4. hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
  5. hive.compactor.initiator.on = TRUE
  6. hive.compactor.worker.threads = a positive number on at least one instance of the Thrift metastore service

Limitations

  • Table must be bucketed
  • tblproperties (“transactional”=”true”)
  • String value should be in quotes ‘abc
  • Only support Orc format
  • Schema level changes are not available now

Compactions

Minor Compaction:- Here all the delta files(update chunks) are get merged and create one main delta file, it will automatically happen and  you also configure it according to your choice.

Major Compaction:- Here it will merge the main delta file with base file and create single new base file. Also configuration

We all know how we can create table similarly below is the command to create a table called base_table

     create table base_table (rt int , name1 string, name2 string, marks int ,age    int) clustered by (rt) into 32 buckets stored as orc TBLPROPERTIES (“transactional”=”true”) ;

Now insert some data into it, there are few ways you can do that lets see what are the ways available

Insert single row by row (1 map- reduce)

  insert into table base_table values (1,’vikas’,’srivastava’,1020,28) ;

Insert multi rows (1-map – reduce)

insert into table base_table values (1,’vikas’,’srivastava’,1020,28) , (2,’nikhil’,’srivastava’,1231,24),(3,’preetika’,’srivastava’,1212,25);

Insert from different table

insert into table base_table select * from old_table

Now data is inserted but you need to remember one thing that each way of inserting data have it own merits.

Now lets try to update some records which has been pushed into base_table

update base_table set name2=”sinha” where rt=3 and name1=”preetika”;

now If you seen the content of the table it will be update

Select * from base_table

  • 1    vikas    srivastava   1028  28
  • 1    Nikhi    srivastava   1231  24
  • 1    Preetika    sinha   1212  25

There are some conditions which you need to remember for update;

  • Sub queries are not supported.
  • Only rows that match the WHERE clause will be updated.
  • Partitioning columns cannot be updated.
  • Bucketing columns cannot be updated.

Now lets see how can we delete the records

delete from base_table where rt=1;

Running above command will delete the 1st record now if you scan the table  you will find only two records.

Select * from base_table

  • 1    Nikhi    srivastava   1231  24
  • 1    Preetika    sinha   1212  25

Note:- Its not good if you have many update/delete frequently becoz it will spawn map-reduce whenever there is any operation performed. SO use it when required.

Hadoop File Recover (after -skipTrash)

Hi All, Today i am going to explain how can we recover the file deleted from the cluster by mistake.

We have a three node HDP cluster, running all the services. We will go step by step to see how we can get file back.

Creating a file in the cluster 

Screen Shot 2018-03-06 at 10.34.01 PM

Keep this on HDFS in my home directory

[hdfs@sandbox-hdp ~]$ hadoop fs -put important.txt /user/hdfs 
[hdfs@sandbox-hdp ~]$ hadoop fs -ls /user/hdfs/ 
Found 1 items
-rw-r--r--   1 hdfs hdfs         49 2018-03-06 14:37 /user/hdfs/important.txt

Now we have file in hdfs and now i will delete it with -skipTrash option

Screen Shot 2018-03-06 at 10.40.13 PM

 

as you can see that i have deleted the file and there is no file available in the home folder.

Now you need to stop the hdfs services in your cluster 

Screen Shot 2018-03-06 at 10.42.43 PM.png

Go to the current directory of your namenode.

[hdfs@sandbox-hdp ~]$ cd /hadoop/hdfs/namenode/current/
[hdfs@sandbox-hdp current]$ pwd
/hadoop/hdfs/namenode/current
[hdfs@sandbox-hdp current]$ ll
total 7036
-rw-r--r-- 1 hdfs hadoop    3280 Mar  5 15:17 edits_0000000000000018851-0000000000000018874
-rw-r--r-- 1 hdfs hadoop 1048576 Mar  5 15:52 edits_0000000000000018875-0000000000000019517
-rw-r--r-- 1 hdfs hadoop    3706 Mar  5 15:56 edits_0000000000000019518-0000000000000019544
-rw-r--r-- 1 hdfs hadoop  899265 Mar  6 12:55 edits_0000000000000019545-0000000000000025898
-rw-r--r-- 1 hdfs hadoop 1048576 Mar  6 14:41 edits_inprogress_0000000000000025899
-rw-r--r-- 1 hdfs hadoop   88701 Mar  5 15:56 fsimage_0000000000000019544
-rw-r--r-- 1 hdfs hadoop      62 Mar  5 15:56 fsimage_0000000000000019544.md5
-rw-r--r-- 1 hdfs hadoop   88525 Mar  6 12:55 fsimage_0000000000000025898
-rw-r--r-- 1 hdfs hadoop      62 Mar  6 12:55 fsimage_0000000000000025898.md5
-rw-r--r-- 1 hdfs hadoop       6 Mar  6 12:55 seen_txid
-rw-r--r-- 1 hdfs hadoop     201 Mar  5 15:12 VERSION

Here you can see that file name  edits_inprogress_0000000000000025899 is present, its nothing but contains the current operation done on hadoop cluster, so lets check the contents of the file and see the operation being done on the cluster.

Convert the file into xml to read its content

hdfs oev -i edits_inprogress_0000000000000025899 -o edits_inprogress_0000000000000025899.xml

Once the file is converted into xml file you need find the record inside the file like below

[hdfs@sandbox-hdp current]$ hdfs oev -i edits_inprogress_0000000000000025899 -o edits_inprogress_0000000000000025899.xml
[hdfs@sandbox-hdp current]$ ls *xml
edits_inprogress_0000000000000025899.xml
[hdfs@sandbox-hdp current]$ 

Lets find record inside the file, here you can see the delete operation has been performed on the file with respective ids.

Screen Shot 2018-03-06 at 10.53.07 PM.png

Once you find the record just delete the above record from the file itself and save it again.

Now the next step is to convert it back into binary

[hdfs@sandbox-hdp current]$ hdfs oev -i edits_inprogress_0000000000000025899.xml -oedits_inprogress_0000000000000025899 -p binary
[hdfs@sandbox-hdp current]$ ll edits_inprogress_0000000000000025899                                                                       
-rw-r--r-- 1 hdfs hadoop 1048576 Mar  6 14:56 edits_inprogress_0000000000000025899              
[hdfs@sandbox-hdp current]$                                                                  

once its converted back into binary format, it can be reloaded into hadoop namenode. While converting you might get error but dont worry its will not impact on anything.

Start the hadoop in recover mode

[hdfs@sandbox-hdp current]$ hadoop namenode -recover 
DEPRECATED: Use of this script to execute hdfs command is deprecated. 
Instead use the hdfs command for it. 
18/03/06 15:00:47 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG: user = hdfs 
STARTUP_MSG: host = sandbox-hdp.hortonworks.com/172.17.0.2 
STARTUP_MSG: args = [-recover] 
STARTUP_MSG: version = 2.7.3.2.6.4.0-91
....
Syncs: 0 Number of syncs: 3 SyncTimes(ms): 8 
18/03/06 15:00:53 INFO namenode.FileJournalManager: Finalizing edits file /hadoop/hdfs/namenode/current/edits_inprogress_00000000000000278
09 -> /hadoop/hdfs/namenode/current/edits_0000000000000027809-0000000000000027810 
18/03/06 15:00:53 INFO namenode.FSNamesystem: Stopping services started for standby state 
18/03/06 15:00:53 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at sandbox-hdp.hortonworks.com/172.17.0.2 
************************************************************/ 
[hdfs@sandbox-hdp current]$ 

While running above command you need to provide input as Y for starting in recover mode and C(continue) if required.

Start the Hdfs Services in Ambari

Screen Shot 2018-03-06 at 11.09.59 PM

Once services are started lets go and check the file and its content in hadoop home directory of user hdfs

Screen Shot 2018-03-06 at 11.31.06 PM.png

Voila !! You can see the file is back 🙂

PS: Just remember to take the backup of edits_inprogress file and other metadata before doing this .

Hope that will save you some day 🙂

Nifi, Solr and Banana – twitter streaming

Today i will be carry forward my last blog which was related to data visualization using ELK, which is elasticsearch , Logstash and Kibana . All these three were older and widely used also. Now we are doing the same thing with the help of new open source tool which are available in the market.

Our architecture would be like this

C

This might be looking little dull but it will be very powerful, let see what are the requirements for creating above project.

Apache Nifi:- Its is very powerful web based ETL tool, we can do various transformation and can be embaded with multiple source and destinations. Very easy to use and can be used for end to end data pipeline.

Download link:-https://www.apache.org/dyn/closer.lua?path=/nifi/0.4.1/nifi-0.4.1-bin.tar.gz

Solr:- Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites

Download link:- http://www.eu.apache.org/dist/lucene/solr/5.4.1/solr-5.4.1.tgz

Banana :- Banana is a tool to create dashboards to visualize data you have stored in Solr. Commonly used with Logstash for log data. Its a fork of kibana

Download link:- https://github.com/LucidWorks/banana/archive/release.zip

Installation of tools

Apache Nifi:- After downloading the Nifi, all you need to do is untar the tar file and move it to /usr/local/nifi .

tar -xzvf nifi-0.4.1-bin.tar.gz

mv nifi-0.4.1   /usr/local/nifi

Now update its configuration with hostname and other optional configuration.

vi /usr/local/nifi/conf/nifi.properties

nifi.web.http.port=8089

Once you update the configuration you can start the service which is pretty simple

./bin/nifi start

Solr :- After downloading the solr, all you need to do is untar the tar file and move it to /usr/local/solr .

tar xzf solr-5.2.1.tgz solr-5.2.1/bin/install_solr_service.sh –strip-components=2

./install_solr_service.sh solr-5.2.1.tgz -i /usr/local/solr -u $user

// where -i is installation directory and $user is the user you are using now.

After you perform above steps you will get the installed directory at /usr/local/solr/ and now you can start the services

mv /usr/local/solr/solr-5.2.1/*  /usr/local/solr/

./bin/solr start -c -z localhost:2181    // if zookeeper is running else without -z option

Once you start the solr you can open the web url at http://hostname:8983/solr

Banana:- After downloading the banana, all you need to do is untar the tar file and build it  .

unzip release.zip

cd release

ant

After you perform above steps you will get the build/banana-0.war and now you need to move these to solr

cp release/build/banana-0.war  /usr/local/solr/server/solr-webapp/webapp/banana.war

cp release/jetty-contexts/banana-context.xml /usr/local/solr/server/context/

// append below line into above banana-context.xml

<Set name=”extractWAR”>true</Set>

Now restart the solr

./bin solr stop

./bin/solr start -c -z localhost:2181

Once its start you need to go to /usr/local/solr/server/banana-webapp/webapp/app/dashboards and replace the default.json

unzip banana.war to /usr/local/solr/server/banana-webapp/webapp

cd /usr/local/solr/server/banana-webapp/webapp/app/dashboards

mv default.json default.json.orig

// given by hortonworks

wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json

Update the solrconfig.xml to get solr recognize the timestamp of tweets

vi /usr/local/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml

<processor class=”solr.ParseDateFieldUpdateProcessorFactory”>
<arr name=”format”>
<str>EEE MMM d HH:mm:ss Z yyyy</str>
<str>yyyy-MM-dd’T’HH:mm:ss.SSSZ</str>

Once you done you can bounce the solr and create the tweet collections

solr create -c tweets  -d data_driven_schema_configs -s 1  -rf 1

// this will result into json format.

Now after starting above you need to create the data pipeline for nifi you need to download the sample workflow for twitter. you can download from here .It is provided by hortonworks.

Now you need to upload that into nifi and implement that.

Capture

After upload you need to drag template on page and you will see the drop down on it now you need to select the the twitter templete and it will be like this.

Capture.PNG

Now double click on the twitter Dashboard and will be enter into bigger diagram where you will see multiple processor and you need define the path for puthdfs and putfile.

Capture.PNG

Right click on puthdfs and click on configure and go to property tab and update the below properties

Hadoop Configuration Resources :- /usr/local/hadoop/etc/hadoop/core-site.xml
Directory :-  /user/hduser/tweets_staging
// these path define the hadoop installation directory and another one is the hdfs path of tweets.
Now update the put file as well
Directory :-  /home/hduser/tweets

Once you done with these you need to update the twitter keys into first processor which is the grab gargen hose. right click and select configuration and go to property tab .

Capture

After updating these property you can start the work flow just by clicking the green play button on main menu.

You will see the the data is transferring between the processors now its turn to see the view on banana.

Open the web url :- http://hostname:8983/banana/#/dashboard

And you will be seeing view like below

Capture.PNG

Thanks all , hope you like it.

Twitter Streaming using ELK

Hi folks today i will be writing about twitter streaming data to analyse , visualization and trends. Its always a good way of showing the analysis using graphs and pictures. We will be searching for few keywords and see what is happening around the word using those words.

Architecture would be like this

Capture

Requirements:- Elasticsearch , Logstash and Kibana.

Elasticsearch:- It is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License. Elasticsearch is the most popular enterprise search engine followed by Apache Solr, also based on Lucene.

Download link:- https://www.elastic.co/downloads/elasticsearch

Logstash:-Its is nothing but log collector as its name says , we will be using this to connect to twitter api and get the tweets as a streams and pass it to elasticsearch so that we can have created index on it.

Download link:- https://www.elastic.co/downloads/logstash

Kibana :- Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.

Download link:- https://www.elastic.co/downloads/kibana

JAVA 7 :- Its is required for Logstash, which is used for log aggregation and push logs into elasticsearch.

Installation of tools

Elasticsearch:- After downloading the elasticsearch, all you need to do is untar the tar file and move it to /usr/local/elasticsearch .

tar -xzvf elasticsearch-2.1.1.tar.gz

mv elasticsearch-2.1.1   /usr/local/elasticsearch

Now update its configuration with hostname and other optional configuration.

vi /usr/local/elasticsearch/config/elasticsearch.yml

network.host: hadoop

http.port:9200

Once you update the configuration you can start the service which is pretty simple

./bin/elasticsearch

LogStash :- After downloading the Logstash, all you need to do is untar the tar file and move it to /usr/local/logstash .

tar -xzvf logstash-2.1.1.tar.gz

mv logstash-2.1.1   /usr/local/logstash

Now update its configuration with Twitter configuration in new configuration file.

vi /usr/local/logstash/twitter.conf

input {
twitter {
consumer_key => “N8pQIcG8AL8EQAIlM6gp5FuPM”
consumer_secret => “80YiYhQqAZ8QJugta9rBWLwS1RGfSzL5”
oauth_token => “106324-I3gt8IiRYfUK6CTurdWprr60XUaqRLth8fd”
oauth_token_secret => “nMEy2T0w0zKsaLdOhgKvbJJUBDvXpTn9dE1F”
keywords => [ “cricket” ]
full_tweet => true
}
}

filter {
}

output {
stdout {}
elasticsearch {
hosts => “hadoop”
index => “twitter”
document_type => “tweet”
template => “template.json”
template_name => “twitter”
}
}

Create a template.json as well which is actually used for output of twitter in json format.

vi /usr/local/logstash/template.json

{
“template”: “twitter”,
“order”:    1,
“settings”: {
“number_of_shards”: 1
},
“mappings”: {
“tweet”: {
“_all”: {
“enabled”: false
},
“dynamic_templates” : [ {
“message_field” : {
“match” : “message”,
“match_mapping_type” : “string”,
“mapping” : {
“type” : “string”, “index” : “analyzed”, “omit_norms” : true
}
}
}, {
“string_fields” : {
“match” : “*”,
“match_mapping_type” : “string”,
“mapping” : {
“type” : “string”, “index” : “analyzed”, “omit_norms” : true,
“fields” : {
“raw” : {“type”: “string”, “index” : “not_analyzed”, “ignore_above” : 256}
}
}
}
} ],
“properties”: {
“text”: {
“type”: “string”
},
“coordinates”: {
“properties”: {
“coordinates”: {
“type”: “geo_point”
},
“type”: {
“type”: “string”
}
}
}
}
}
}
}

Once you update the configuration you can start the service which is pretty simple

./bin/logstash -f twitter.conf

After starting logstash and elasticsearch you will see that logstash is started generating the messages and elasticsearch is starting updating tweets. Now lets go for kibana for visualization.

Kibana:- After downloading the kibana you need to do the same stuff which you have done for logstash and elasticsearch , untar it and update the configuration.

tar -xzvf kibana-4.3.1-linux-x64.tar.gz

mv kibana-4.3.1   /usr/local/kibana

Update the configuration file kibana.yaml

vi /usr/local/kibana.yaml

server.host: “hadoop”                                     // hostname

elasticsearch.url: “http://hadoop:9200”        //elasticsearch hostname and port

Once you update these just start the process

./bin/kibana

Now after starting process open the web url http://hadoop:5601 . 5601 is default port for kibana and you will see web page like below.

Capture

At that yellow colour place you need to provide the index which you used in kibana for indexing. you can put the same name which we have provided in logstash configuration file “twitter“. Once you select this you will see a another time-field name you need to select “@timestamp” like below.

Capture

Now click on create and you can make it as default index as well by clicking on * button on same page .

Now on same page click on object tab and import the twitter dashboard which i have created as sample.

Capture.PNG

Now Upload the below json as dashboard Twitter-templete.json.

[
{
“_id”: “Twitter-Dashboard”,
“_type”: “dashboard”,
“_source”: {
“title”: “Sample Twitter Dashboard”,
“hits”: 0,
“description”: “”,
“panelsJSON”: “[{\”col\”:1,\”id\”:\”Total-Tweets\”,\”row\”:4,\”size_x\”:4,\”size_y\”:2,\”type\”:\”visualization\”},{\”col\”:1,\”id\”:\”Twitter-Dashboard\”,\”row\”:1,\”size_x\”:4,\”size_y\”:3,\”type\”:\”visualization\”},{\”col\”:5,\”id\”:\”Tweets-vs.-time\”,\”row\”:1,\”size_x\”:8,\”size_y\”:5,\”type\”:\”visualization\”},{\”col\”:1,\”id\”:\”Top-10-hashtags\”,\”row\”:6,\”size_x\”:6,\”size_y\”:4,\”type\”:\”visualization\”},{\”col\”:7,\”id\”:\”Top-10-influencers-(by-retweet-volume)\”,\”row\”:6,\”size_x\”:6,\”size_y\”:4,\”type\”:\”visualization\”}]”,
“version”: 1,
“timeRestore”: true,
“timeTo”: “now”,
“timeFrom”: “now-15m”,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”filter\”:[{\”query\”:{\”query_string\”:{\”analyze_wildcard\”:true,\”query\”:\”*\”}}}]}”
}
}
},
{
“_id”: “Tweets-vs.-time”,
“_type”: “visualization”,
“_source”: {
“title”: “Tweets vs. time”,
“visState”: “{\”type\”:\”histogram\”,\”params\”:{\”shareYAxis\”:true,\”addTooltip\”:true,\”addLegend\”:true,\”scale\”:\”linear\”,\”mode\”:\”stacked\”,\”times\”:[],\”addTimeMarker\”:false,\”defaultYExtents\”:false,\”setYExtents\”:false,\”yAxis\”:{}},\”aggs\”:[{\”id\”:\”1\”,\”type\”:\”count\”,\”schema\”:\”metric\”,\”params\”:{}},{\”id\”:\”2\”,\”type\”:\”date_histogram\”,\”schema\”:\”segment\”,\”params\”:{\”field\”:\”@timestamp\”,\”interval\”:\”auto\”,\”customInterval\”:\”2h\”,\”min_doc_count\”:1,\”extended_bounds\”:{}}}],\”listeners\”:{}}”,
“description”: “”,
“version”: 1,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”index\”:\”twitter\”,\”query\”:{\”query_string\”:{\”query\”:\”*\”,\”analyze_wildcard\”:true}},\”filter\”:[]}”
}
}
},
{
“_id”: “Total-Tweets”,
“_type”: “visualization”,
“_source”: {
“title”: “Total Tweets”,
“visState”: “{\”aggs\”:[{\”id\”:\”1\”,\”params\”:{},\”schema\”:\”metric\”,\”type\”:\”count\”}],\”listeners\”:{},\”params\”:{\”fontSize\”:60},\”type\”:\”metric\”}”,
“description”: “”,
“version”: 1,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”index\”:\”twitter\”,\”query\”:{\”query_string\”:{\”query\”:\”*\”,\”analyze_wildcard\”:true}},\”filter\”:[]}”
}
}
},
{
“_id”: “Top-10-hashtags”,
“_type”: “visualization”,
“_source”: {
“title”: “Top 10 hashtags”,
“visState”: “{\”type\”:\”histogram\”,\”params\”:{\”shareYAxis\”:true,\”addTooltip\”:true,\”addLegend\”:true,\”scale\”:\”linear\”,\”mode\”:\”stacked\”,\”times\”:[],\”addTimeMarker\”:false,\”defaultYExtents\”:false,\”setYExtents\”:false,\”yAxis\”:{}},\”aggs\”:[{\”id\”:\”1\”,\”type\”:\”count\”,\”schema\”:\”metric\”,\”params\”:{}},{\”id\”:\”2\”,\”type\”:\”terms\”,\”schema\”:\”segment\”,\”params\”:{\”field\”:\”entities.hashtags.text\”,\”size\”:10,\”order\”:\”desc\”,\”orderBy\”:\”1\”}}],\”listeners\”:{}}”,
“description”: “”,
“version”: 1,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”index\”:\”twitter\”,\”query\”:{\”query_string\”:{\”query\”:\”*\”,\”analyze_wildcard\”:true}},\”filter\”:[]}”
}
}
},
{
“_id”: “Top-10-influencers-(by-retweet-volume)”,
“_type”: “visualization”,
“_source”: {
“title”: “Top 10 influencers (by retweet volume)”,
“visState”: “{\”type\”:\”table\”,\”params\”:{\”perPage\”:10,\”showMeticsAtAllLevels\”:false,\”showPartialRows\”:false},\”aggs\”:[{\”id\”:\”1\”,\”type\”:\”max\”,\”schema\”:\”metric\”,\”params\”:{\”field\”:\”retweeted_status.favorite_count\”}},{\”id\”:\”2\”,\”type\”:\”terms\”,\”schema\”:\”bucket\”,\”params\”:{\”field\”:\”retweeted_status.user.screen_name\”,\”size\”:100,\”order\”:\”desc\”,\”orderBy\”:\”1\”}}],\”listeners\”:{}}”,
“description”: “”,
“version”: 1,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”index\”:\”twitter\”,\”query\”:{\”query_string\”:{\”analyze_wildcard\”:true,\”query\”:\”*\”}},\”filter\”:[]}”
}
}
},
{
“_id”: “Twitter-Dashboard”,
“_type”: “visualization”,
“_source”: {
“title”: “Twitter Dashboard”,
“visState”: “{\”type\”:\”markdown\”,\”params\”:{\”markdown\”:\”### Sample Dashboard for Twitter stream\\nSimple dashboard for exploring & visualizing tweets tracking a topics of interest. Create new visualizations, customize the dashboard, find new insights.\\n\\n**Happy exploration!!!**\”},\”aggs\”:[],\”listeners\”:{}}”,
“description”: “”,
“version”: 1,
“kibanaSavedObjectMeta”: {
“searchSourceJSON”: “{\”query\”:{\”query_string\”:{\”query\”:\”*\”,\”analyze_wildcard\”:true}},\”filter\”:[]}”
}
}
}
]

Once you save it on you desktop and import it into kibana objects you can see the below image. click on Open sign and you will see a drop down and select the Sample twitter dasboard.

 

Capture

 

After clicking on Twitter Dashboard, You will See the graphs like below.

Capture

You can create your own new dashboards and add into it. Hope you like this blog.

Realtime Data persistence into Cassandra Using STORM

Hi Folks,

Today I will be doing some Real time data generation and data persistence into NoSql(Cassandra) using storm, We will be creating key-space and schema where data will be stored and will see how storm will be helpful to store the data into Cassandra on real-time bases.

Lets See the how we can do that, We will be generating some sample data from storm and process it and them push that into cassandra on realtime basis.

Java files

CassandraSpout.java :- Spout is basically a first part for processing in storm here we will be using it to generating the data or reading the data from the source and passing that to bolt for further processing.

CassandraBolt.java :- this is bolt which will getting data from sprouts and try to transform it. After transforming it will be making connection to cassandra and pushing stream of tuples into cassandra.

CassandraTopology.java :- This is the main topology file which will be taking care of all the data tuples and pushing data into into cassandra.

CassandraConnector.java :- This Class file is basically taking care of creating and inserting the data into cassandra schema, Bolt will be using this class file to pass the data to insert into cassandra.

CassandraConstant.java :- this is an interface file where we defined all the constant which will be used into this project.

Connection.java :- This is main connection class file which will be used for making connection with cassandra cluster and get the information.

Coding Part

It will be very length if i put all the coading you can download from github.

https://github.com/onlyricks/Storm-Cass.git

Thanks

Vikas Srivastava

 

 

Ldap Integration with Hadoop Cluster – Part 1

Hi Folks,

Some days before i have seen some warning in Namenode logs, which are like below

2015-09-05 06:06:35,317 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user vikas
2015-09-05 06:06:35,323 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user vikas
2015-09-05 06:06:35,334 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user vikas

These warning are becoz the namenode wont be able to find group for the user vikas, I didn’t created this user on namenode host, So there are two things we can do to remove this issue.

  1. Create every user on namenode , who are submitting jobs from edge node to cluster.
  2. Setup ldap on NameNode so that it can authenticate users from ldap server.

Hadoop uses “id -gN $user” to verify the user exist on cluster. So lets setup ldap cluster so that we dont need to create all the users on namenode , instead of verifying from local user its can fetch the users from ldap and authenticate.

lets understand what is ldap and how it can solve our problem.

LDAP:-  Lightweight Directory Access Protocol, is an Internet protocol that email and other programs use to look up information from a server. LDAP is mostly used by medium-to-large organizations. If you belong to one that has an LDAP server, you can use it to look up contact info and other services on a network, and provide “single signon” where one password for a user is shared between many services. LDAP is appropriate for any kind of directory-like information, where fast lookups and less-frequent updates are the norm.

Ldap has two most common objects

  1. cn (common name) :- These are the end leaf entry of organization like (user/groups)
  2. dc (domain component) :- These are the main component which define your domain name like ex hadoop.com , dc=hadoop,dc=com

Requirement:- 

  • Disabled Selinux (setenforce 0)
  • Iptables should be stop (service iptables stop)

Setup Ldap Server on any Node :- You need to install openldap on any node which can be treated as master node. where all the database of ldap user and other things can be stored.

 yum install -y openldap openldap-servers openldap-clients

Above command will install the ldap server on your ldap master server now you have to set the configuration first clear the folder like /var/lib/ldap/ and copy the sladp.conf into /etc/openldap/

cp /usr/share/openldap-servers/slapd.conf.obsolete /etc/openldap/slapd.conf

cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG 

Now generate passsword for admin user of ldap which is Manager in my case.

slappasswd

It will ask you to type the password and gives you encrypted password something like that {SSHA}WDGprkqwN5ht7prIb8GoUXRRypDEHx83 copy this for future use. Now lets open sladp.conf

vi /etc/openldap/sladp.conf

// change below values 

#TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
#TLSCertificateFile /etc/pki/tls/certs/slapd.pem
#TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem

access to *
by dn.exact=”gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth” read
by dn.exact=”cn=Manager,dc=hadoop,dc=com” read
by * none

database bdb
suffix “dc=hadoop,dc=com”
checkpoint 1024 15
rootdn “cn=Manager,dc=hadoop,dc=com”
rootpw {SSHA}WDGprkqwN5ht7prIb8GoUXRRypDEHx83

Now you need to own the files and folders with ldap

chown -R ldap. /var/lib/ldap

// Below command will create configuration files setup in /etc/openldap/slapd.d 

slaptest -f  /etc/openldap/slapd.conf -F /etc/openldap/slapd.d 

// this might throw some warning but not to worry its just not create any issue now change the ownership to ldap for this folder as well

chown -R ldap. /etc/openldap/slapd.d

// now you can restart the process 

service slapd restart

Now everything is setup on ldap master node but still there is no user and other object are created here so lets see how you can add the objects into ldap servers.

cat base.ldif

dn: dc=hadoop,dc=com
dc: hadoop
objectClass: dcObject
objectClass: organization
o: hadoop.com

dn: ou=users,dc=hadoop,dc=com
ou: users
objectClass: organizationalUnit
objectClass: top

dn: ou=groups,dc=hadoop,dc=com
ou: groups
objectClass: organizationalUnit
objectClass: top

You can see that above we have defined

  1. o: oragnization which is hadoop.com
  2. ou: organizationalUnit which is users and groups

we still havn’t defined any cn(component name) which will define later lets add this to into ldap server with below command

ldapadd -x -W -D “cn=Manager,dc=hadoop,dc=com” -f base.ldif

Enter LDAP Password:
adding new entry “dc=hadoop,dc=com”
adding new entry “ou=users,dc=hadoop,dc=com”
adding new entry “ou=groups,dc=hadoop,dc=com”

Now you have added above objects into ldap you can try and search with ldapsearch command, like below

ldapsearch -x -W -D “cn=Manager,dc=hadoop,dc=com” -b “dc=hadoop,dc=com

Now lets add users into users organizationalUnit

vi users.ldif

dn: uid=vikas,ou=users,dc=hadoop,dc=com
objectClass: top
objectClass: account
objectClass: posixAccount
objectClass: shadowAccount
cn: vikas
uid: vikas
uidNumber: 16859
gidNumber: 100
homeDirectory: /home/vikas
loginShell: /bin/bash
gecos: vikas
userPassword: vikas

Add with same commands as we have used earlier to add other objects, after that you have one user as well added into ldap but that user is not in local server.

ldapadd -x -W -D “cn=Manager,dc=hadoop,dc=com” -f users.ldif

Now your server configuration is done now you need to setup the client setup on this server and other server like namenode and datanodes, so that these can identify the users from ldap servers.

yum -y install openldap openldap-clients nss-pam-ldapd pam_ldap nscd autofs rpcbind nfs-utils

now after installing clients on nodes you need to configure them to use ldap as authentication system for your systems, so lets see what are the files your need to edit for that .

vi /etc/openldap/ldap.conf

BASE dc=hadoop,dc=com
URI ldap://horse.hadoop.com:389/

Now you need to add this to authentication for all the users lets set this as authentication process.

authconfig-tui

Auth1

Auth2

authconfig –enablemkhomedir –update

Now you can see the user added into ldap server here as well.

getent passwd vikas

vikas:*:16859:100:vikas:/home/vikas:/bin/bash

Now your ldap setup is complete on all the nodes check if you can access this ldap user from all the nodes if yes then we can go ahead and implement this into hadoop cluster configuration for authentications. I will explain that in next part of this blog.

Storm Overview and Implementation

Storm:- Storm is distributed real-time computation system, which have unique feature like once its started it will run forever until someone kills the process unlike any other streaming tools, Storm have controlled over unbounded stream of tuples. Originally it was developed in twitter and later handed over to apache to make it more mature.Here i am listing some of its features

  • Fault tolerant
  • fail-fast / stateless
  • guaranteed data processing
  • scalable
  • Stability
  • distributed

Use-cases:- real-time Analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.

General use of storm is to integrate it with some online source / pipeline of data which produce the series of tuples and ingest into storm. lets see the pictorial view of it.

storm-architecture41

picture:- Internet

In the above diagram we can see many components like nimbus, worker, spout and bolts etc. Lets start from first half picture where we can see the architecture of Storm cluster which includes the process and tasks. lets understand what are they and what are their purpose.

Nimbus:- Nimbus is a main process which is similar to job-tracker in hadoop, for those who are not familiar with hadoop , job-tracker is someone who take the ownership of jobs from starting till its end. It have the responsibility of distributing the tasks among the worker nodes and get the status for its successful or failure.

Supervisor:- Supervisor is master process process runs on worker nodes and get the tasks from nimbus to process the data accordingly, It spawn the worker process to complete the task, no of task depends on the task provide by nimbus to it. It provides the task updates to nimbus whenever required.

Zookeeper:- Zookeeper is the co-ordination process between nimbus and supervisor, Both nimbus and supervisor don’t save their states on its own. It is basically maintain by zookeeper on local disks.So it can be killed and restarted without any loss it can pick up their states from zookeeper.

Topology:- Topology is nothing but the bunch of code to run application like map-reduce which intend to process the data accordingly. Topology include spout and bolts which are like map and reduce from hadoop.

Spout:- Spout is the code to process the streams of unbounded tuples and pass them to bolts after implementing some logic, It simply read the tuples from streams and pass streams to bolts for further processing.

Bolts:- Bolts are the open to receive unlimited amount of streams of tuples and apply some logic, whatever we want to apply and if we want to again emit we can emit a new stream of tuples to another bolts.

Transactional Topologies :- Storm has a feature called transactional topologies that let you achieve exactly-once messaging semantics for most computations.

Distributed RPC:- The idea behind distributed RPC (DRPC) is to parallelize the computation of really intense functions on the fly using Storm. The Storm topology takes in as input a stream of function arguments, and it emits an output stream of the results for each of those function calls.

Lets see how we can implement it on cluster and test it with dry run.

Requirement:-  We required java 1.6 and above, zookeeper, jzmq and zeromq, Let see the commands we need to follow creating a user with storm user and export the java in .profile

// creating a user and added into hadoop group , update the password as well
$ useradd -u 506 -g 506 Storm -G hadoop
$ passwd Storm

// exporting java home in .profile of storm user
export STORM_HOME="/home/storm/storm-0.9.4"
export JAVA_HOME="/usr"
export PATH=$STORM_HOME/bin:$JAVA_HOME/bin:$PATH
export ZOOKEEPER_HOME="/home/storm/zookeeper-3.4.6"
export PATH=$ZOOKEEPER_HOME/bin:$PATH

Download the storm from here and execute the following commands to untar and update the configurations

$  wget http://mirrors.advancedhosters.com/apache/storm/apache-storm-0.9.4/apache-storm-0.9.4.tar.gz

$ tar -xvzf apache-storm-0.9.4.tar.gz 
$ mv apache-storm-0.9.4 storm-0.9.4
$ ll storm-0.9.4
[storm@kafka ~]$ ll storm-0.9.4/
total 240
drwxrwxr-x. 2 storm storm 4096 Jul 23 23:24 bin
-rw-r--r--. 1 storm storm 41375 Mar 18 11:52 CHANGELOG.md
drwxrwxr-x. 2 storm storm 4096 Jul 23 23:25 conf
drwxrwxr-x. 4 storm storm 4096 Jul 23 23:37 data
-rw-r--r--. 1 storm storm 538 Mar 18 11:51 DISCLAIMER
drwxr-xr-x. 3 storm storm 4096 Mar 18 11:51 examples
drwxrwxr-x. 5 storm storm 4096 Jul 23 22:53 external
drwxrwxr-x. 2 storm storm 4096 Jul 23 22:53 lib
-rw-r--r--. 1 storm storm 23004 Mar 18 11:51 LICENSE
drwxrwxr-x. 2 storm storm 4096 Jul 23 23:10 logback
drwxrwxr-x. 2 storm storm 4096 Jul 24 00:29 logs
-rw-------. 1 storm storm 109974 Jul 23 23:13 nohup.out
-rw-r--r--. 1 storm storm 981 Mar 18 11:51 NOTICE
drwxrwxr-x. 6 storm storm 4096 Jul 23 22:53 public
-rw-r--r--. 1 storm storm 10987 Mar 18 11:52 README.markdown
-rw-r--r--. 1 storm storm 6 Mar 18 12:02 RELEASE
-rw-r--r--. 1 storm storm 3581 Mar 18 11:52 SECURITY.md

Now we have to change the configuration file conf/storm.yaml of storm and update the nimbus and supervisor node informations.

  • nimbus.host (host where nimbus is running)
  • storm.local.dir (where storm can store its /tmp/data)
  • supervisor.slots.ports (these are the slots which supervisor will use to run worker thread)
  • nimbus.thrift.port (this is used to connect to nimbus)
$ conf/storm.yaml
########### These MUST be filled in for a storm configuration
 storm.zookeeper.servers:
 - "storm1"

 storm.zookeeper.port: 2181
 nimbus.host: "storm"
 nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
 ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
 supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
 worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
 nimbus.thrift.port: 8627
 ui.port: 8772
 storm.local.dir: "/home/storm/storm-0.9.4/data"
 java.library.path: "/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/"
 supervisor.slots.ports:
 - 6700
 - 6701
 - 6702
 - 6703
 storm.log.dir: "/home/storm/storm-0.9.4/logs/"

Now we have done with storm, there are few more things which we required for running the storm cluster properly like zeromq jzmq and zookeeper , you can direct download just by clicking on them.

// we need to install some dependencies before build the zeromq and jzmq.
$ sudo yum install libuuid* uuid-* gcc-* git libtool*

// installation of zeromq 
$ wget http://download.zeromq.org/zeromq-2.1.7.zip
$ wget unzip  zeromq-2.1.7.zip
$ ./configure && make 

// installation of jzmq 

$ git clone https://github.com/nathanmarz/jzmq.git
$ cd jzmq ; /Makefile.am
$ sed -i 's/classdist_noinst.stamp/classnoinst.stamp/g' src/Makefile.am
$ ./autogen.sh
$ ./configure && make install  

// Download zookeeper and configured it on nimbus node

$ wget http://www.webhostingreviewjam.com/mirror/apache/zookeeper/stable/zookeeper-3.4.6.tar.gz  
$ unzip zookeeper-3.4.6.tar.gz 
$ mkdir zookeeper-3.4.6/data

// update zookeeper configuration file zoo.conf

dataDir=~/data
# the port at which the clients will connect
clientPort=2181 

Now we have almost completed all the steps , we need to perform similar steps on all the worker nodes as well so that each one have similar configuration on it.

Now we need to start the services in below format to see if everything is running fine on nodes.

// on Nimbus node
$ zkServer.sh start 
$ nohup storm nimbus &
$ nohup storm supervisor &
$ nohup strom ui &  

// on all the worker nodes
$ nohup storm supervisor &

You can check the services running on node by running jps command, like below

[Storm@storm1 ~]$ jps
3354 core
3247 nimbus
3440 Jps
3332 supervisor
3083 QuorumPeerMain

Now you can view the web ui at http://storm1:8772

We will be running sample codes in next blog of Storm Sample code.

Hope you find this blog useful

Apache Flink Overview and Implementation

Apache Flink is optimized for cyclic or iterative processes by using iterative transformations on collections.It is a platform for efficient, distributed, general-purpose data processing. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams, i.e., data elements are immediately “pipelined” though a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.

Just like Spark, Flink is able to digest both batch and streaming data, but it also goes a step further. It’s capable of analyzing streaming data leveraging in-memory processing to enhance its overall processing speed.Apache Flink can help deliver results at the speed of today’s business.

Important process

1. Job Manager :- Client submit the jobs Graph to Job manager which consist of Operator(JobVertex) and intermediate dataset. Each Operator properties, like the parallelism and the code that it executes. Job manager is responsible for execution of the jobs distributed among all the task manager and complete the job. Its create a execution DAG and replicate it multiple copies and send to task Manager to execute it in parallel fashion.

2. Task Manager:- Each TaskManager will have one or more task slots, each of which can run one pipeline of parallel tasks. A pipeline consists of multiple successive tasks, such as the n-th parallel instance of a MapFunction together with the n-th parallel instance of a ReduceFunction. Note that Flink often executes successive tasks concurrently: For Streaming programs, that happens in any case, but also for batch programs, it happens frequently, Here is the picture make it simple to understand.

Capture

picture:- apache flink website

Lets see how we can set up cluster to make it work.

Download apache flink from here. You can choose which version you want to install according to your requirement, You can download the version compatible with different hadoop/hbase versions.

There are few thing which we need to keep in mind before implementing the flink cluster, They are we should be using same directory and username for each of the worker and master node. we can use sshd passwordless from master to worker so that we can start the services from master itself(which i don’t recommend), You need to create a user with proper uid and gid and added to hadoop group.

$ useradd -u 1000 -g 500 flink -G hadoop
$ passwd flink

it required java > 1.6 which is more than sufficient to have, You need to download and export in your profile file, like below. Exporting java into .profile file.

export JAVA_HOME = /usr
export PATH = $JAVA_HOME/bin:$PATH

Now you can untar it in anywhere you want and then configure its configuration file flink.yaml . Let see what are the next steps .

tar -xzvf flink-0.9.0-bin-hadoop1.tgz
mv  flink-0.9.0-bin-hadoop1 flink-0.9.0

[flink@kafka flink-0.9.0]$ ll
total 88
drwxr-xr-x. 2 flink flink 4096 Jun 18 03:34 bin
drwxr-xr-x. 2 flink flink 4096 Jul 17 12:33 conf
drwxr-xr-x. 2 flink flink 4096 Jun 18 03:34 examples
drwxr-xr-x. 2 flink flink 4096 Jun 18 03:34 lib
-rw-r--r--. 1 flink flink 31086 Jun 18 03:34 LICENSE
drwxr-xr-x. 2 flink flink 4096 Jul 20 01:59 log
-rw-r--r--. 1 flink flink 17312 Jun 18 03:34 NOTICE
-rw-r--r--. 1 flink flink 1308 Jun 18 03:34 README.txt
drwxr-xr-x. 3 flink flink 4096 Jun 18 03:34 resources
-rw-rw-r--. 1 flink flink 6 Jul 17 12:29 slaves
drwxr-xr-x. 5 flink flink 4096 Jun 18 03:34 tools

Now you need to change the configuration file conf/flink.yaml

  • the amount of available memory per TaskManager (taskmanager.heap.mb),
  • the number of available CPUs per machine (taskmanager.numberOfTaskSlots),
  • the total number of CPUs in the cluster (parallelism.default) and
  • the temporary directories (taskmanager.tmp.dirs)
jobmanager.rpc.address: $hostname
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 256
taskmanager.heap.mb: 512
taskmanager.numberOfTaskSlots: 4
parallelism.default: 3

jobmanager.web.port: 8081
webclient.port: 8080
state.backend: jobmanager

Directories for tmp files used by each thread / directory
/data1/tmp:/data2/tmp:/data3/tmp

 taskmanager.tmp.dirs: /tmp

Now you need to add all the slaves/worker nodes in conf/slaves file

flink1
flink2
flink3

Configurations are done now we are ready to start the service and run a sample job.

$ bin/start_cluster.sh
$ Starting Job Manager
$ Starting task manager on host flink1

You can check the services running on node by jps command.
$ jps
2733 TaskManager
2611 JobManager
2764 Jps

If you want to run cluster in streaming mode you just need to start the streaming services by below command.

$ bin/start-cluster-streaming.sh

Lets run any sample job to test its working. create one file with some paragraph and store it as words.txt and now run the below command with its input and output path.

$ bin/flink run ./examples/flink-java-examples-0.9.0-WordCount.jar file://`pwd`/../words.txt file://`pwd`/../wordcount-result3.txt

07/27/2015 00:51:37 Job execution switched to status RUNNING.
07/27/2015 00:51:37 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED
07/27/2015 00:51:38 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING
07/27/2015 00:51:40 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to RUNNING
07/27/2015 00:51:43 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED
07/27/2015 00:51:43 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING
07/27/2015 00:51:43 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to RUNNING
07/27/2015 00:51:43 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to FINISHED
07/27/2015 00:51:44 DataSink (CsvOutputFormat (path: file:/home/flink/wordcount-result3.txt, delimiter: ))(1/1) switched to SCHEDULED
07/27/2015 00:51:44 DataSink (CsvOutputFormat (path: file:/home/flink/wordcount-result3.txt, delimiter: ))(1/1) switched to DEPLOYING
07/27/2015 00:51:44 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to FINISHED
07/27/2015 00:51:44 DataSink (CsvOutputFormat (path: file:/home/flink/wordcount-result3.txt, delimiter: ))(1/1) switched to RUNNING
07/27/2015 00:51:45 DataSink (CsvOutputFormat (path: file:/home/flink/wordcount-result3.txt, delimiter: ))(1/1) switched to FINISHED
07/27/2015 00:51:45 Job execution switched to status FINISHED.

You can see the result in outfile file just by cat it.

$ cat /home/flink/wordcount-result3.txt | head
000 3
08 1
1 21
10 2
100 1
12 1
1604 1
1787 3
1971 1
1990 13

Flink is pretty simple to setup and run, It is simply very fast then other as well, hopefully you find it useful. Please comment your reviews if i have miss something.

Thanks you

BigData Tool’s Overview

I am Writing this becoz I have seen many of people are looking for bigdata tools and in different prospects. Many of know the names and Not really aware of its usability. Which tool should be used when, So lets have a look what are the tools available in different streams.

DataBase – DataWarehousing :- Data Warehousing is used for reporting and data analysis. Bigdata However, requires different data warehouse than traditional standard ones used in the past years,. There are multiple open sources data-warehouse available for different purposes.

  • InfoBright :- It offers a data warehouse that is scalable and that can store upto 50Tb of data.They offer a data compression technique that is upto 40:1 for better functioning. Next to open source they also offer commercial products based on the same technology. It is especially designed to analyse large amounts if machine generated data. The latest  edition has capabilities of nearly realtine analysis.
  • Cassandra :-  It is a NoSql Database that was initially created by Facebook to cater its messages. Now Apache is managing it now a days. It is mainly used by large organization that have massive active databases. Companies such as twitter , Cisco, Netflix use it to optimize their databases. Cassandra also offers commercial support and services.
  • Apache Hbase:- Its also a product of apache foundation and its includes linear and modular scalability. It is the non -relational data store for hadoop. Hvase is used by companies who need to random, real-time read/write access to bigdata. Its Objective is to host large table with billions of row and millions of columns with use of commodity hardware.
  • Riak :- It is distributed database that is open source, scalable , fault tolerant. It is especially architect-ed for  replication the retrieving data intelligently in order to read and write operations , even when the operation fail. Users can even lose access to nodes without losing the data. Riak’s customers are among others the Danish government, Boeing and Kipp.me
  • Apache Hive :- This is Hadoop’s data warehouse and it uses a SQL like language called HQL. It promises easy data summarization, ad-hoc queries and Other analysis of bigdata. It Uses a mechanism to project Structure onto data, while allowing Map/Reduce Programmers to plug in custom mapper and reducers. Hive us open source volunteer project under Apache Software foundation.

There are various more in the market but above are the some of most usable databases in bigdata Basket.

In-Memory Open Source Tools :- We have seen the increasing amount of data that need to be processed in the real time and in memory is  gaining traction. Here are the few in-memory opensource tools.

  • SAP HANA :- SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. HANA’s architecture is designed to handle both high transaction rates and complex query processing on the same platform. Sap Hana was previously called SAP High-Performance Analytic Appliance.
  • MemSQL:- MemSQL is a distributed, relational database for transactions and analytics at scale. Querying is done through standard SQL drivers and syntax, leveraging a broad ecosystem of drivers and applications.  MemSQL has a two-tiered, clustered architecture. Each instance of the MemSQL program is called a “node”, and runs identical software. The only difference is the role the nodes are configured to play.

Bigdata Analytics Platforms  :- There are tools present in the market which are good as data analytics platform, some of them are listed below.

  • Storm:- Its is earlier owned by twitter and now its under apache foundation, It is real time distribution computation system. It works the same ways as hadoop provides general primitives for performing the real-time analysis. Storm is easy to use and it works with any language. It is very scalable and fault tolerant.
  • HPCC:-  Its means high performance computing cluster and was develop by lexisNexis Risk solutions. It is Similar version of hadoop, but it claims to offer superior performance. There is free and paid version available. it works with structured and unstructured data it is scalable from 1-1000s of nodes. It there also offers high-performances, parallel big-data processing.
  • Spark:- Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and spark Streaming.
  • Hadoop :-You simply can’t talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms “Hadoop” and “big data” are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X.
  • Flink :- Flink is an open-source Big Data system that fuses processing and analysis of both batch and streaming data. The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop’s MapReduce component with its own runtime. Yet the system still provides access to Hadoop’s distributed file system and YARN resource manager.

Business Intelligence :-Its Been very important point in data warehousing. There some of the BI tools available specially for bigdata prospects.

  • Talend :- Talend makes a number of different business intelligence and data warehouse products, including Talend Open Studio for Big Data, which is a set of data integration tools that support Hadoop, HDFS, Hive, Hbase and Pig. The company also sells an enterprise edition and other commercial products and services.
  • JesperSoft:- Jaspersoft boasts that it makes “the most flexible, cost effective and widely deployed business intelligence software in the world.” The link above primarily discusses the commercial versions of its applications, but you can find the open source versions, including the Bigdata Reporting Tool.
  • Birt :- Short for “Business Intelligence and Reporting Tools,” BIRT is an Eclipse-based tool that adds reporting features to Java applications. Actuate is a company that co-founded BIRT and offers a variety of software based on the open source technology.
  • Pentaho :- Used by more than 10,000 companies, Pentaho offers business and big data analytics tools with data mining, reporting and dashboard capabilities. Seethe Pentaho Community Wiki for easy access to the open source downloads.

Data Mining Tools:- Data mining , an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics.

  • RapidMiner :- RapidMiner claims to be “the world-leading open-source system for data and text mining.” Rapid Analytics is a server version of that product. In addition to the open source versions of each, enterprise versions and paid support are also available from the same site.
  • Mahout:- This Apache project offers algorithms for clustering, classification and batch-based collaborative filtering that run on top of Hadoop. The project’s goal is to build scalable machine learning libraries.
  • Rattle:- Rattle, the “R Analytical Tool To Learn Easily,” makes it easier for non-programmers to use the R language by providing a graphical interface for data mining. It can create data summaries (both visual and statistical), build models, draw graphs, score datasets and more.
  • KEEL:- KEEL stands for “Knowledge Extraction based on Evolutionary Learning,” and it aims to help uses assess evolutionary algorithms for data mining problems like regression, classification, clustering and pattern mining. It includes a large collection of existing algorithms that it uses to compare and with new algorithms.

BigData Searching Tools:- Implementing searching inside bigdata is obiouves a great task, there we have some of best tools for that.

  • Lucene:- The self-proclaimed “de facto standard for search libraries,” Lucene offers very fast indexing and searching for very large datasets. In fact, it can index over 95GB/hour when using modern hardware.
  • Solr:- Solr is an enterprise search platform based on the Lucene tools. It powers the search capabilities for many large sites, including Netflix, AOL, CNET and Zappos.

Bigdata Aggregation and Transfer:- Tools which are used to pull the logs and transfer the data from different sources to bigdata.

  • Sqoop:- Sqoop transfers data between Hadoop and RDBMSes and data warehouses. As of March of this year, it is now a top-level Apache project.
  • Flume:- Another Apache project, Flume collects, aggregates and transfers log data from applications to HDFS. It’s Java-based, robust and fault-tolerant.
  • Chukwa:- Built on top of HDFS and MapReduce, Chukwa collects data from large distributed systems. It also includes tools for displaying and analyzing the data it collects

Messaging System:- Produce the messages continously and provide to realtime systems for processing it.

  • Kafka:- Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
  • RabbitMq :- Messaging enables software applications to connect and scale. Applications can connect to each other, as components of a larger application, or to user devices and data. Messaging is asynchronous, decoupling applications by separating sending and receiving data.

I hope you will find it useful, Thanks