Data scientist questionaires

An understanding of modern data software and tools, from cloud computing to the many existing commercial products used for data analytics

As a programmer and a data scientist, I have come across numerous data tools. Though several business analysts complaint that there is very little use for pre-packaged, generally inflexible commercial tools, I myself found the quality of a commercial solution such as Matlab as they have a number of ways for real-time signal processing that I cant find in open tools.

When I did analysis job outside of academia, I have been interested in ten top user-friendly free tools to help visualize data and solve data analysis problems:

Tableau Public; OpenRefine (Formerly GoogleRefine); KNIME; RapidMiner; Google Fusion Tables; NodeXL; Import.io; Google Search Operators; Solver; WolframAlpha. These are mostly found good for business users. However, for who need the quality, support, and infrastructure of a commercial solution, I recommend SAS or SPSS.

For programming works, if my project is not a real-time system or hardware design, I found that R and Python libraries are valuable resources to use. R has great graphic capabilities and hence users who do get involved in analytic may find it better suited. Similarly python can be a great tool especially if you need a web based solution for your users.

A knowledge of modern data-centric programing languages including one of Python, R, and development languages including one of C++ or Java *

In recent ten years, I have done with Python and R for machine learning in some research collaboration but most of the research tasks I could done within Matlab which is well-supported in academia and a must have when we do with real-time signal processing system. Then when my research projects move down to the firmware and hardware design level, C++ is used most of the time.

For example, while I carried out experiments at The Higgins laboratory which studies " From brains to robots, and everything in between" with Prof. Charles M. Higgins, we used FPGA as a special chip for my robot platform that can help understand neural signal from a living dragonfly. So for the software at algorithm design level I have used Matlab scripts. Then I convert this algorithm to a firmware for the robot using C++ which would be transferred further to lower level in FPGA.

During the research periods at the University of Sydney, my works involve much in data science with medical data. Hence, apart from Matlab, I incorporate other tool for analysis such as R and Python. Using several languages at the same time help me enlarge the project and quickly find the right way in new research problems.

Experience in the use of complex data sets

Dealing with medical data I understand that it is very important to understand the current complexity of data, and its potential complexity in the future. The main reason I found is that complex data require additional work to prepare and model the data before it is “ripe” for analysis and visualization.

Some signs of the complexity I noted are: Size (petabytes may need shifting compressed data into memory. Tall or wide data may require special tools); Dispersed data that stored in multiple locations are both more difficult to gather in a timely and effective manner, and once gathered – will typically require some ‘cleaning’ or standardization before the data sets can be cross-referenced and analyzed; Growth rate of data may cost hardware and software resources.

From my experience, understanding the dataset is the first step towards finding an appropriate solution. Data-mining techniques I often use include prediction, clustering, finding outliers, finding local patterns, and matrix decomposition.

Excellent interpersonal, verbal and written communication skills with sound negotiating and conflict resolution skills and a demonstrated level of tact and discretion in dealing with day to day operational matters

Throughout my career spanning 10+ years, I possess high level oral and written communication skills which can be demonstrated over the past 7 years in academia and previously for 4 years in technical manager roles.

In industrial works, I have communicated with internal and external stakeholders both verbally and in writing on a daily basis including staff, business partners, boards of managers, government and community organisations. My persuasive discussions with our business partners have brought to the company fruitful agreements. I have the ability to influence without the use of direct authority, demonstrated in my role as Chief representative officer that helps my team handle call center projects smoothly so that we grew up from a small team to a branch of the company.

In recent academic jobs, my verbal and written skills have moved to an advanced level of formal and technical vocabulary for published work in prestigious publication venues. My communication skills now turn to sharp opinions during weekly meetings of the laboratory groups and/or with other scientists in multi-field collaborations (e.g., the Westmead hospital and Woolcock institute). I have published top tier conference talks, posters and journal papers through my Master of Science program in the U.S. at The University of Arizona and my Ph.D. works at The University of Sydney.

Ability to work both independently and as part of a team, taking initiative and exercising sound judgement in resolving matters that may arise as part of normal daily work

Encouraging productive working relationships, I have shown myself to be an active team member with a proven track history of cultivating associations and detaching obstacles to produce trust and enhance productivity achieving improved workplace communication and team relationships. I have worked without direct supervision, arranging my day and work tasks independently, remaining flexible and adaptable. I have built excellent relationships with managers and team members at both Petrolimex IT JSC., and HTD Telecoms and with several of government officers with whom I remain regular contact. I often participate in after-hours events with co-workers and maintain links with senior and junior members.

During my postgraduate research periods, I have gained a thorough understanding of situations where deadlines need to be strictly observed, gaining advanced skills in delegating, planning work tasks according to importance with the ability to work productively in a high pressure, fast paced environment. Sound judgement and common sense have been discussed and applied in my work both in industrial field and academic area with professional expertise. I have found sound judgement often works as key factor in order to publish successful science works.

Demonstrated ability to organise and prioritise work, sometimes under pressure, with competing priorities and sometimes limited supervision *

In my industrial employment history, my work mostly dealt with pressure and tight timelines during seasonal customer demand. For examples, when we launched the call center service for the high school certificate exam of the year, all of the technical systems and staff as well as the strategies for marketing and customer service have been under a high pressure due to time constraints.. As a technical manager,

I effectively utilised key individuals and delegated portions of the tasks, ultimately quality assuring all the work as it was delivered back to myself. This approach enabled me to deliver the task on time and we have never had technical interrupts during this season. Furthermore, these seasonal high workloads often return most profit of the year.

During the academia experience, deadline of paper submission and grant proposals have been typical examples of pressure workload. I have been revising and editing manuscripts through numerous of iterations especially when we have involvement of more than four co-authors usually in a short time. I often keep the work at myside no more than two or three days each iterations. With this rule, I have never miss any deadline during my postgraduate research both in the U.S. and Australia.

Tertiary education in in a computational area of study (computer science, computational statistics engineering, physics, biology, for example)

Throughout my postgraduate education studies at The University of Arizona (USA) and The University of Sydney, I have developed strong analytical, research, leadership and conceptual skills, acquiring a wealth of knowledge on biomedical data and it’s statistical application.

Before I start my higher degree journey, I have in depth experience with electronics for information and telecommunication network systems. In 2010, I started my MSc., at The Higgins laboratory which studies " From brains to robots, and everything in between" with Prof. Charles M. Higgins. I built a solid background in neuroscience for hybrid bio-robot that incorporate a living insect with a robot. My works ranged from hardware design for robot to software design for the hybrid interface. This software turns to a conference report at EMBC in Chicago recently and is included as a main contribution in my MS thesis. The real time signal processing for neural data channel recorded from the dragonfly is the key point helped me succeeded in my oral thesis defense.

In 2013, I joined The University of Sydney for my Ph.D. degree with a prestigious scholarship from Australian Prime Minister via Australia Awards (around $300 thousand). I have developed methods for medical data such as human movement monitoring to detect freezing of gait in patients with Parkinson's disease, respiratory artefact removal in Forced oscillation technique lung function test, and automatic spike sorting for electrophysiological recordings (nEMG, ECG). I have also participated in data analysis with NICU dept. of the Westmead hospital to find a way to detect sepsis in infants. Most of these works mainly belong to data science area which integrated of data mining, machine learning, engineering and statistics. I contributed a systematic feature selection scheme to improve subject independent systems using advantages of new engineering and machine learning techniques. These works have been published on top tier journals recently.

Desirable:An appreciation of modern data science methods and familiarity with statistical or machine learning techniques

I have studied several data science approaches in order to propose automated classification for biomedical

data using modern machine learning techniques targeting to subject-independent settings of experiments and deployment. Introducing feature selection techniques which use mutual information between feature and target class and/or discrimination level of features (clusterability), I have found most salient features for most classification tasks.

In a point anomaly detection project, I proposed binary anomaly scores using thresholding information to indicate abnormal instances during human movement monitoring. Then in collective anomaly detection projects, anomaly scores incorporate statistical information such as quartiles and interquartile range. Within and between session coefficient of variation have been typical examples of domain metrics to assess performance of my methods. Supervised and unsupervised classification methods such as Support Vector Machine and neural networks have been evaluated across numerous projects I have collaborated with hospitals and institutes to analyze data.

--------------------------------------------------------------------------------------

Java is still top dog, and developers don’t seem to completely hate it, though it’s sure not their favorite. C and C++ remain popular, both with developers and employers. Python and C# also provide a good trade-off in terms of popularity with developers and employer demand.

Developers may love Lua and Lisp-y languages, but they don’t appear to have much mainstream use. Scala may be a better bet among functional/multiparadigm languages, at least in terms of employment opportunities.

Meanwhile, JavaScript remains wildly popular both in actual use and developer enthusiasm.

--------------------------------------------------------------------------------------

1.4k What is the difference between data-centric and object-oriented application models?

7k: The two concepts are somewhat orthogonal, a Data Centric Application is one where the database plays a key role, where properties in the database may influence the code paths running in your application and where the code is more generic and all/most business logic is defined through database relations and constraints. OOP can be used to create a data centric application.

Some of the large multi-tier architectures which people think of when they say OOP architecture implement business logic in code and just store the data in the database. However, it would be wrong to think Object Oriented design necessarily has to be a large business logic ridden system.

Say you have to implement message passing between two systems. One way (although a bad way) is to have each of the systems write the messages to the database and the other system read from the database every so often to pick up messages. This would be a data centric approach as there is very little code needed other than reading and writing data.

The same system could be implemented by having the systems open a socket connection to each other and send messages directly. In this way there is more code and less database access. This is the non-datacentric approach. Either of these could be implemented using OOP concepts.

Another example from my work we implement servers for games, one type of server handles multi-player game play so user presses the button and spaceship fires missile at other player. This server is not datacentric it is event based. Another server stores the users high scores, friend lists etc this server is thin wrapper over the database which stores the score and lists.

A data centric design is one where the application behavior is encapsulated by data. A simple example. Consider the following OOP class:

class Car { void move(x, y); private: int x, y; }

This is an OOP representation of a car. Invoking the 'move' method will trigger the car to start moving. In other words, any side effects are triggered by invoking the class methods.

Here's the same class, but data centric:

class Car { int x, y; }

In order to get this car moving, I would "simply" change the values of x and y. In most programming languages changing members won't allow for the execution of logic, which is why data centricity often requires a framework.

In such a framework, logic is ran upon the C, U and D of CRUD. Such a framework will provide the appropriate facilities to enable code insertion at any of these events, for example:

- DBMS triggers
- OMG DDS waitsets/listeners (DDS is a data centric messaging standard)
- corto observers (corto is a data centric application framework)

Data centric design has many implications. For example, since an application state is effectively represented by its data, you can automatically persist the application. A well-written data centric application can be stored, stopped and restored from a database, and continue like it was never gone.

Data centric designs are a good match for traditional 3 tier web architectures. Web applications are typically driven by the contents of the backend database. That is why, when you close and reopen a dynamic webpage, it still looks the same (provided the data didn't change).

--------------------------------------------------------------------------------------------------------

What commercial software should a data scientist purchase?

If you are a data scientist, then there is very little use for pre-packaged, generally inflexible commercial tools.

Part of the reason OSS is so prevalent and useful in data science is that you will often need to combine and/or modify a procedure to fit the needs at hand -- and then deploy it without a bunch of lawyers and sales reps getting involved at every step.

Since data scientists are expected to be proficient programmers, you should be comfortable digging into the source code and adding functionality or making it more user friendly.

I've come close to recommending the purchase of non-free(as in GPL) a couple times only to find that some industrious person has set up a project in Git that provides most if not all of the functionality if commercial software. In the cases where it doesn't, it at least addresses the core issue and I can modify and extend from it. It's much easier to modify a prototype than start from scratch.

Bottom line: be wary of commercial software for data science unless you've done your due diligence in the OSS space and can honestly say that you could not find any projects that could be modified to suit your needs. Commercial software is not only less flexible but you're effectively in a business partnership with these folks and that means your fates are somewhat intertwined (at least for the projects that depend in this software).

------------------

If you are looking for a technical package I would definitely go with Mathematica. It has functions for a very wide range of disciplines that come standard with it (no extra packages to buy). This allows you to expand into techniques as you grow without having to change platforms or buy extra packages. It also has a great interactive format in CDF with both a free and professional reader. The documentation is interactive as well which really helps a lot.

However, you have to be open to a little bit of structure, typing & chaining functions together, and things like this. Once you get started it comes very quickly.

-----------------------------------

A cool open source CAS is Sage...you might like it if you're into Mathematica

no, I need the quality, support, and infrastructure of a commercial solution. Mathematica is it

----------------------------------------

The question is what do you expect the tool to serve. May be SAS or SPSS is the way to go or even MATLAB. Each tool brings something to the table. Depending on what you need, one or more may fit the bill. Also look at your org's tech road map. If Open Source is a big push, your list may be different.

Tableau is a good tool (and easy to learn) for Business Users or people who are less technical. Even though it is costly ($2000 for a single user annual license), it has a good reader (free) and once a report is generated, the users can leverage the reader and update without dependence on the license holder. The benefit will lie as long as the reports are stable and standard in nature.

R has great graphic capabilities and hence users who do get involved in analytic may find it better suited. Similarly python can be a great tool especially if you need a web based solution for your users.

So think about your needs and strategy.

------------

I agree! Our company has been abandoning Tableau because of their ridiculous fees...clients would rather pay us to develop customized software that is more integrated with their data...and then they actually own the software too!

--------------------------------------------------------------------------------------------------------

Ten free, easy-to-use, and powerful tools to help you analyze and visualize data, analyze social networks, do optimization, search more efficiently, and solve your data analysis problems.

the Top 10 Data analysis tools for Business. I picked these because of their free availability (for personal use), ease of use (no coding and intuitively designed),

Tableau Public: Tableau democratizes visualization in an elegantly simple and intuitive tool. It is exceptionally powerful in business because it communicates insights through data visualization. Although great alternatives exist, Tableau Public's million row limit provides a great playground for personal use and the free trial is more than long enough to get you hooked. In the analytics process, Tableau's visuals allow you to quickly investigate a hypothesis, sanity check your gut, and just go explore the data before embarking on a treacherous statistical journey.

OpenRefine: Formerly GoogleRefine, OpenRefine is a data cleaning software that allows you to get everything ready for analysis. What do I mean by that? Well, let's look at an example. Recently, I was cleaning up a database that included chemical names and noticed that rows had different spellings, capitalization, spaces, etc that made it very difficult for a computer to process.

Fortunately, OpenRefine contains a number of clustering algorithms (groups together similar entries) and makes quick work of an otherwise messy problem.

**Tip- Increase Java Heap Space to run large files (Google the tip for exact instructions!)

KNIME: KNIME allows you to manipulate, analyze, and modeling data in an incredibly intuitive way through visual programming. Essentially, rather than writing blocks of code, you drop nodes onto a canvas and drag connection points between activities. More importantly, KNIME can be extended to run R, python, text mining, chemistry data, etc, which gives you the option to dabble in the more advanced code driven analysis.

**TIP- Use "File Reader" instead of CSV reader for CSV files. Strange quirk of the software.

RapidMiner: Much like KNIME, RapidMiner operates through visual programming and is capable of manipulating, analyzing and modeling data. Most recently, RapidMiner won KDnuggets software poll, demonstrating that data science does not need to be a counter-intuitive coding endeavor.

Google Fusion Tables: Meet Google Spreadsheets cooler, larger, and much nerdier cousin. Google Fusion tables is an incredible tool for data analysis, large data-set visualization, and mapping. Not surprisingly, Google's incredible mapping software plays a big role in pushing this tool onto the list. Take for instance this map, which I made to look at oil production platforms in the Gulf of Mexico.

With just a quick upload, Google Fusion tables recognized the latitude and longitude data and got to work.

NodeXL: NodeXL is a visualization and analysis software of networks and relationships. Think of the giant friendship maps you see that represent linkedin or Facebook connections. NodeXL takes that a step further by providing exact calculations. If you're looking for something a little less advanced, check out the node graph on Google Fusion Tables, or for a little more visualization try out Gephi.

Import.io: Web scraping and pulling information off of websites used to be something reserved for the nerds. Now with Import.io, everyone can harvest data from websites and forums. Simply highlight what you want and in a matter of minutes Import.io walks you through and "learns" what you are looking for. From there, Import.io will dig, scrape, and pull data for you to analyze or export.

Google Search Operators: Google is an undeniably powerful resource and search operators just take it a step up. Operators essentially allow you to quickly filter Google results to get to the most useful and relevant information. For instance, say you're looking for a Data science report published this year from ABC Consulting. If we presume that the report will be in PDF we can search

"Date Science Report" site:ABCConsulting.com Filetype:PDF

then underneath the search bar, use the "Search Tools" to limit the results to the past year. The operators can be even more useful for discovering new information or market research.

Solver: Solver is an optimization and linear programming tool in excel that allows you to set constraints (Don't spend more than this many dollars, be completed in that many days, etc). Although advanced optimization may be better suited for another program (such as R'soptim package), Solver will make quick work of a wide range of problems.

WolframAlpha: Wolfram Alpha's search engine is one of the web's hidden gems and helps to power Apple's Siri. Beyond snarky remarks, Wolfram Alpha is the nerdy Google, provides detailed responses to technical searches and makes quick work of calculus homework. For business users, it presents information charts and graphs, and is excellent for high level pricing history, commodity information, and topic overviews.