Sources of data
The connection of data sources to analytical tools must meet two basic criteria:
- Ensuring a stable data transfer from their place of origin – an interface that can connect with different information systems and Web services (it is sometimes necessary to extract data from web pages), to identify the basic structure of the data and store it.
- Processing of data and their preparation to a format suitable for further analysis - identifying data type, format, encoding other parameters of individual items and their processing into a machine processable form applicable for analysis and machine learning
Our interface can be connected to the current internal and external data sources and includes configurable modules that enable scraping (extracting data) from the listed data sources in multiple mark-up languages.
- Internal sources - information systems (CRM, storage system, accounting software systems for project management, DMS), logs from servers, communication content (e-mail, SMS), attendance, and so on.
- External sources - visitor statistics (Google Analytics), advertising systems (Google Adwords, Sklik, etc.), public databases (database of companies) and websites.
Types of data obtained
- Structured data are data which is stored in precisely defined and described data fields. When people talk about databases they mostly have just structured data in mind. A typical example is a customer database in which each record consists of a name, address, account number, and so on. Structured data have a clear model and have a description, and therefore can be well stored, processed and analyzed.
- Unstructured data on the contrary do not have a precisely defined structure. This category includes all data that do not have a stable firmly-defined structure, such as images, videos, websites or content of e-mail and or/ other communications. Unstructured data constitutes the absolute majority of generated data and the term "big data" refers primarily to them.
- Semi-structured data are the intersection of the two above. This is a type of structured data without an accurate model. An example might be an entire e-mail, which consists of unstructured data – text content and attachments, and structured content - mail header with a precisely defined structure and fields (sender, recipient, date and time of departure).