• Complain

Andreas François Vermeulen [Andreas François - Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

Here you can read online Andreas François Vermeulen [Andreas François - Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2018, publisher: Apress, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Andreas François Vermeulen [Andreas François Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Learn how to build a data science technology stack and perform good data science with repeatable methods. You will learn how to turn data lakes into business assets.
The data science technology stack demonstrated in Practical Data Science is built from components in general use in the industry. Data scientist Andreas Vermeulen demonstrates in detail how to build and provision a technology stack to yield repeatable results. He shows you how to apply practical methods to extract actionable business knowledge from data lakes consisting of data from a polyglot of data types and dimensions.
What Youll Learn
  • Become fluent in the essential concepts and terminology of data science and data engineering
  • Build and use a technology stack that meets industry criteria
  • Master the methods for retrieving actionable business knowledge
  • Coordinate the handling of polyglot data types in a data lake for repeatable results
Who This Book Is For
Data scientists and data engineers who are required to convert data from a data lake into actionable knowledge for their business, and students who aspire to be data scientists and data engineers

Andreas François Vermeulen [Andreas François: author's other books


Who wrote Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets? Find out the surname, the name of the author of the book and a list of all author's works by series.

Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Andreas Franois Vermeulen 2018
Andreas Franois Vermeulen Practical Data Science
10. Transform Superstep
Andreas Franois Vermeulen 1
(1)
West Kilbride North Ayrshire, UK
The Transform superstep allows you, as a data scientist, to take data from the data vault and formulate answers to questions raised by your investigations. The transformation step is the data science process that converts results into insights.
It takes standard data science techniques and methods to attain insight and knowledge about the data that then can be transformed into actionable decisions, which, through storytelling, you can explain to non-data scientists what you have discovered in the data lake.
Any source code or other supplementary material referenced by me in this book is available to readers on GitHub, via this books product page, located at .
Transform Superstep
The Transform superstep uses the data vault from the process step as its source data. The transformations are tuned to work with the five dimensions of the data vault. As a reminder of what the structure looks like, see Figure .
Figure 10-1 Five categories of data Dimension Consolidation The data vault - photo 1
Figure 10-1
Five categories of data
Dimension Consolidation
The data vault consists of five categories of data, with linked relationships and additional characteristics in satellite hubs.
To perform dimension consolidation, you start with a given relationship in the data vault and construct a sun model for that relationship, as shown in Figure .
Figure 10-2 T-P-O-L-E hHigh-level design I will cover the example of a - photo 2
Figure 10-2
T-P-O-L-E hHigh-level design
I will cover the example of a person being born, to illustrate the consolidation process.
Note
You will need a Python editor to complete this chapter, so please start it, then we can proceed to the data science required.
Open a new file in the Python editor and save it as Transform-Gunnarsson_is_Born.py in directory ..\VKHCG\01-Vermeulen\04-Transform . You will require the Python ecosystem , so set it up by adding the following to your editor (you must set up the ecosystem by adding the libraries):
import sys
import os
from datetime import datetime
from datetime import timedelta
from pytz import timezone, all_timezones
import pandas as pd
import sqlite3 as sq
from pandas.io import sql
import uuid
pd.options.mode.chained_assignment = None
Find the working directory of the examples.
################################################################
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
Set up the company you are processing.
Company='01-Vermeulen'
Add the company work space:
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db'
conn1 = sq.connect(sDatabaseName)
Add the data vault.
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
os.makedirs(sDataVaultDir)
sDatabaseName=sDataVaultDir + '/datavault.db'
conn2 = sq.connect(sDatabaseName)
Add the data warehouse.
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir)
sDatabaseName=sDataVaultDir + '/datawarehouse.db'
conn3 = sq.connect(sDatabaseName)
Execute the Python code now, to set up the basic ecosystem.
Note
The new data structure, called a data warehouse, is in directory ../99-DW .
The data warehouse is the only data structure delivered from the Transform step. Lets look at a real-world scenario.
Gumundur Gunnarsson was born on December 20, 1960, at 9:15 in Landsptali, Hringbraut 101, 101 Reykjavk, Iceland. Following is what I would expect to find in the data vault.
Time
You need a date and time of December 20, 1960, at 9:15 in Reykjavk, Iceland. Enter the following code into your editor (you start with a UTC time):
print('Time Category')
print('UTC Time')
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
print(BirthDateZoneUTCStr)
Formulate a Reykjavk local time.
print('Birth Date in Reykjavik :')
BirthZone = 'Atlantic/Reykjavik'
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
print(BirthDateStr)
You have successfully discovered the time key for the time hub and the time zone satellite for Atlantic/Reykjavik .
  • Time Hub : You have a UTC date and time of December 20, 1960, at 9:15 in Reykjavk, Iceland, as follows:
    1960-12-20 10:15:00 (UTC) (+0000)
  • Time Satellite : Birth date in Reykjavk:
    1960-12-20 09:15:00 (-01) (-0100)
Now you can save your work, by adding the following to your code.
Build a data frame, as follows:
IDZoneNumber=str(uuid.uuid4())
sDateTimeKey=BirthDateZoneStr.replace(' ','-').replace(':','-')
TimeLine=[('ZoneBaseKey', ['UTC']),
('IDNumber', [IDZoneNumber]),
('DateTimeKey', [sDateTimeKey]),
('UTCDateTimeValue', [BirthDateZoneUTC]),
('Zone', [BirthZone]),
('DateTimeValue', [BirthDateStr])]
TimeFrame = pd.DataFrame.from_items(TimeLine)
Create the time hub.
TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']]
TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
sTable = 'Hub-Time-Gunnarsson'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Time-Gunnarsson'
TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")
Create the time satellite.
TimeSatellite=TimeFrame[['IDNumber','DateTimeKey','Zone','DateTimeValue']]
TimeSatelliteIndex=TimeSatellite.set_index(['IDNumber'],inplace=False)
BirthZoneFix=BirthZone.replace(' ','-').replace('/','-')
sTable = 'Satellite-Time-' + BirthZoneFix + '-Gunnarsson'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
TimeSatelliteIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Time-' + BirthZoneFix + '-Gunnarsson'
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets»

Look at similar books to Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets»

Discussion, reviews of the book Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.