Populating Cloud Data Warehouses with Apache NiFi


Problem:


Constantin, a Master Data Engineer pointed out that JSON, Parquet or Avro are the most used semi-structured data types, but some Enterprise Applications still produce XML!  Parsing XML is expensive,  Ingesting XML is challenging and Large XML files are hard to chunk for parallel ingest.  We still need this data and don't want to write custom code to parse it.   So let's break it down and ingest this data at scale.

Source:   https://github.com/tspannhw/clouddatawarehouse


Ingest Raw Data:



<?xml version='1.0' encoding='utf-8'?><rss xmlns:dc='http://purl.org/dc/elements/1.1/' version='2.0'><channel><title>travel.state.gov: Travel Advisories</title><link>travel.state.gov: Travel Advisories</link><description>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories.html</description><item><title>Bhutan - Level 1: Exercise Normal Precautions</title><pubDate>Fri, 25 Jan 2019</pubDate><link>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</link><guid>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</guid><dc:identifier>BT</dc:identifier><description><![CDATA[<p>Exercise normal precautions in Bhutan.&nbsp;</p>]]></description></item>



Apache NiFi Flow to Read US Travel Advisors XML




Let's Ingest Those Advisors For Fast Queries and ML Analysis








Any Data, Anywhere, Anytime


With Apache NiFi you can populate Cloud Data Warehouses with ease directly through JDBC or to S3 buckets from sources including email, REST, Logs, Databases, NoSQL stores, Hadoop, XML, PDF, Word, AVRO, JSON, CSV, Text, Syslog and more.   For this example I am grabbing XML/RSS feeds from public sources on travel advisors around the world for US citizens.   It is easy to route, cleanse, transform and store at scale in our hyperscale Cloud Data Warehouse regardless of cloud.

Source URL:   https://travel.state.gov/_res/rss/TAsTWs.xml



Table Design

Let's create some tables. You should design an ER schema and use either 3NF, Snowflake Schema, Star Schema or Data Vault. Here are two simple tables to contain all the fields we have. We may want to add a primary key and date / timestamp to our travelwarning table as well.


CREATE TABLE TRAVELWARNING ( title VARCHAR,  pubDate VARCHAR,  link VARCHAR,  guid VARCHAR,  identifier VARCHAR,  description VARCHAR );

CREATE TABLE TRAVELWARNING_RAW( TW_WARN_ID NUMBER(38,0) identity primary key, TW_WARNING VARIANT, TW_CREATE_DT DATE DEFAULT CURRENT_DATE );



Connecting to A Cloud Data Warehouse
Image
We just need a JDBC Driver, Driver class name, URL, username, password and download JAR.   Now we can connect for queries or inserts with ease at full speed.   Make sure you download the latest version of the JDBC driver you need and store it somewhere that the Apache NiFi user can use it.  I recommend a special directory for these drivers.  
We can easily use record processing to ingest 1,000s of records and convert them to useful types like JSON or AVRO before inserting them into our data warehouse.
Let's insert rows into our Cloud Data Warehouse on AWS as raw JSON into a VARIANT column using PutSQL.
Image

We can also use Records to easily infer the schema and push to our our regularly defined table using PutDatabaseRecord.

We can now query the data as it is arriving.
Image

Other Ingest Options

We can also land the JSON data in S3 and have that ingested via Snowpipe or bulk load with COPY

Examples of Other Data We Can Ingest




Useful Cloud Data Warehouse Resources







Comments

  1. I appreciate the time and effort you put into this article about Azure Data Migration Services. This information is both useful and lavish. I'd like to express my gratitude to everyone who contributed to this article.

    ReplyDelete
  2. The article you've shared here is fantastic because it provides some excellent information that will be incredibly beneficial to me. Thank you for sharing that. Keep up the good work. warehouse Japan

    ReplyDelete
  3. I always prefer to read the quality content and this thing I found in you post. I am really thank full for you for this post. we also provide service for warehouse Japan. for more info visit our website.

    ReplyDelete
  4. I appreciate the work and effort that went into this article. This paragraph provides you with a clear picture. Without a question, it is an insightful piece that may assist us in expanding our understanding.Vietnam Import Data

    ReplyDelete
  5. Play Slots Online with CHLOE Casino
    Explore Play Slots Online with งานออนไลน์ CHLOE Casino ☆ Sign up now for instant play ✚ deccasino Best online slots casino bonuses ⏩Play choegocasino the most popular

    ReplyDelete
  6. Where to Bet on Sports To Bet On Sports In Illinois
    The best sports bet types and https://deccasino.com/review/merit-casino/ bonuses available in worrione Illinois. The most common sports betting options 토토 사이트 코드 available. Bet $20, 출장샵 Win $150, Win $100 or wooricasinos.info

    ReplyDelete
  7. Chrome plating is a way of electroplating a thin layer of chromium onto a metal object. The chromed layer could be decorative, present corrosion resistance, ease cleaning procedures, or Stockings improve floor hardness. Sometimes, a less expensive imitator of chrome could also be} used for aesthetic functions. CNC Turning CNC turning with live tooling combines each lathe and mill capabilities to machine parts with cylindrical options from metal rod inventory. DEK is your one-stop associate for high-quality merchandise, we make on-demand machining & manufacturing simple and quick, from prototyping to end-part production. This is exhausting and fast|a set} seat for rotating shaft of medical tools, which is used to attach and fix two rotating shafts.

    ReplyDelete
  8. Where 3D printing was only appropriate for prototyping and one-off manufacturing within the early levels, it is now quickly remodeling right into a production technology. Describe your thought and requirements, and our staff of CAD engineers and digital sculptors will create a three-dimensional design in STL, OBJ, STEP, or IGES file format. A comprehensive information Fedora Hats on designing elements for Binder Jetting, covering the printing course of, design specifications and materials choices. Most dedicated 3D printing file preparation software program can repair these errors, however success is not at all times guaranteed. It is nice design practice to combine all our bodies right into a single stable within the native CAD software program before exporting them into STL.

    ReplyDelete
  9. A teaser is a sort of soccer or basketball wager in which the pointspread is adjusted by further points in the player’s favor on multiple sport. In soccer, a participant might transfer the posted pointspread 6, 6.5, 7, 10 or 14 points . In basketball, four, 5 and 6-point sports betting teasers are supplied. Placing a sports wager used to mean going to a sportsbook and standing in line. Now most people 제왕카지노 tend to decide on} on-line sportsbooks to put bets, but in today’s day and age, {you can also|you might also|you can even} use some innovativesports betting apps seen on Basketball Insiders.

    ReplyDelete

Post a Comment