January 2024 – IT recommendations and ideas

January 29, 2024January 14, 2024

When, where, and when not in dynamic IT infrastructure

When to use different types of I/O

Memory-mapped I/O:

Devices with both input and output, such as networking.

Reading from solid-state drives.

Port-mapped:

One-way devices such as the input-only keyboard and mouse.

Writing to solid-state drives.

Interrupt-driven

Low-speed but short-running devices such as mouse and keyboard.

Direct memory access:

ATA Optical drives
FireWire Cameras and recorders

Channel I/O

SCSI and Fibre Channel printers, scanners, and storage

ASIC to offload the TCP/IP stack.

When to use different storage media

Flash memory:

Frequent reads and few writes, such as programs, especially firmware in NOR flash and operating system software in NAND flash.

Where not to use flash memory

Frequent writes

Where to use hard disk drives

Storing content with frequent writes, such as documents and databases.

Where to use optical drives

Frequent reads but not frequent writes, such as storing pictures, videos, music, and podcasts.

When to deploy applications in application virtual machines

Java:

Server-side web applications called servlets
Client-side web applications called applets
Messaging
Enterprise Java Beans
Distributed object storage
Some database management systems, especially wide-column stores, document-oriented databases, and object-relational databases.

Python:

Server-side programming
Distributed object storage

When not:

Clustered and distributed filesystems (program in C, C++, Modula-3, or Rust instead and run it native)

When to run applications as Docker and Kubernetes microservices

OLTP NewSQL databases like VoltDB and CockroachDB.
Web servers like Apache and NGinX.
Cloud management platform

When not:

Clustered and distributed filesystems (program in C, C++, Modula-3, or Rust and run it native) and object storage (program in C++, Java or Python instead).
Analytical column-oriented DBMS (program in C, C++, Modula-3, or Rust and run it native).

When to SQL and NoSQL

When to NewSQL and distributed SQL

Row store or key-value store for online transaction processing (OLTP), including but not limited to sales transactions.

Where to stick to traditional SQL

Massively parallel column-oriented data storage and management for big data analytics and business intelligence.

When to NoSQL

Document-oriented database for content management system.

Key-value store for storing short-term data such as shopping carts and discounts.

Wide-column/column-family stores for recording a large amount of data from numerous sources.

Graph database for social networking and for recommendations of similar and related items.

When to use parallel vs. serial interfaces

When to use parallel

For internal and short-distance connections

When to use serial:

For long-distance external connections

When to use fabric vs. bus

When to use a fabric of switches

For shared-memory multiprocessors

host-to-target connectivity

Where to use direct static fabrics

Compute network, especially in message-passing MIMD multicomputers.

Network-on-Chip, like a MultiProcessor System-on-Chip (MPSoC) and Massively Parallel Processor Array (MPPA).

When to consolidate infrastructure

Consolidate Ethernet and Fibre Channel into Converged Enhanced Ethernet or DataCenter Ethernet, or alternatively use iSCSI.
Consolidate Ethernet and InfiniBand into RDMA over Converged Ethernet. One alternative would be to use IP-over-InfiniBand. May use iSCSI and/or SCSI RDMA Protocol in either case.

When not to consolidate

Have RapidIO for system fabric, but bridge to PCI bus for expansion which we’ll get to in just a moment. The only consolidation would be avoiding Ethernet.

Where to use bus

For I/O expansion

For shared memory

January 22, 2024January 7, 2024

Conclusions and solutions in computer architecture and organization

Systolic arrays are optimal for multimedia processing, while coarse-grain reconfigurable arrays and dynamic dataflow processors may be more suitable for artificial intelligence. Wavefront array processors might be good for machine learning as well.

Von Neumann architecture for CPU and Harvard architecture for coprocessors. The former issues instructions straight to the latter. Both CPU and coprocessors load and store data straight from and to memory. Dataflow processor, like a coarse-grained reconfigurable array (CGRA), for attached processor.

Enhanced dataflow, control flow/dataflow, and dataflow/control flow are all more effective than enhanced control flow. Dataflow/control flow is the most advantageous of all the four. Enhanced dataflow and control flow/dataflow are about equally effective, maybe the latter has a slight advantage. To have the best qualities, have dataflow at the hyperblock level, Von Neumann control flow at the basic block level, and dataflow at the instruction level, thereby combining the dataflow/control flow way of data-driven multithreading (DDM) at all levels but the bottom one with control flow/dataflow way of TRIPS or DYSER at all levels but the topmost. That would be three levels, rather than two. Such hierarchical design would be great for a Datacenter processing unit (DPU), I wonder why the Fungible F1 DPU isn’t designed that way. One could have more than one dataflow level, but it wouldn’t make sense to have more than one control flow level.

Von Neumann/Dataflow hybrid architectures like MT.Monsoon, data-driven multithreading, scheduled dataflow, TRIPS, and DYSER may be suitable for infrastructure offload like datacenter processing units (DPUs). Hyperprocessor, MLCA, and Task Superscalar would also be suitable, as would a massively parallel processor array (MPPA) – each worker processor may have Tartan, Conservation Cores, or DYSER.

Correct me if I’m wrong, but I think a field-programmable object array (FPOA) is analogous to composable disaggregated infrastructure, but instead of pooling compute, storage, and accelerators at the datacenter level, it pools arithmetic logic units, register files, and multiply-accumulators at the chip level, and instead of composing servers, it composes processor cores. Similarly, polymorphous architectures like TRIPS can dynamically allocate cores and register files to particular tasks, thereby creating virtual cores within a physical core.

January 15, 2024February 24, 2024

Orthodox brief computer concepts

Abstract

Transactional SQL database may be vertically scaled and either shared-everything or shared-disk, while OLAP DBMS should be shared-nothing. Both should be more tightly coupled than NoSQL database.
References and sources are each specific to section.

Basic things

Chip multiprocessor with data-driven multithreading, rather than superscalar. In the case of multiple-issue, have static placement like VLIW but dynamic issue like superscalar.

Direct execution of operating system code

Stored in flash memory. At least the kernel be stored in NOR flash to allow for direct code execution.

Data would still be stored in hard disk storage and loaded into RAM.

This could be Harvard architecture, see “Asymmetric multiprocessing” below.

Isolated I/O

Multi-path, multi-stage interconnection network to connect processors to memory, bus to connect processors to I/O. The former consisting of semiconductor crosspoint matrix switches.

Channel I/O

A backplane bus, to which are connected at least one system bus and four I/O expansion buses. One or two CPU sockets and multiple memory module slots per system bus. One I/O processor (IOP) connecting the backplane bus to each I/O bus. While the CPU and its respective system bus may be on the same board as the backplane bus, each I/O processor and its respective I/O bus is in its own expansion card, therefore at least four expansion cards. May mix and match expansion cards for storage and networking.

Have DMA controllers for synchronous and isochronous devices while having I/O processors for asynchronous devices, for the latter of which longer-running I/O and endpoint devices with flexible latency requirements would go directly to memory, whereas shorter-running I/O and endpoint devices with more stringent latency requirements would go straight to the CPU, and at least some of the I/O controllers would have built-in multiplexing namely that used for terminals. DMA controllers for FireWire cameras, recorders, and scanners and ATA optical drives, and I/O channels for SCSI RAID and printers. May mix and match networking and storage processors, the former of which may be application-specific integrated circuits (ASIC) or reconfigurable arrays to offload the network stack while the latter be I/O processors (IOPs).

Another idea is to make a mainframe from commodity hardware, which will be discussed in a later section.

Asymmetric multiprocessing

Host CPU accesses to I/O and runs the operating system which is stored on flash memory and written in Assembly, C, and/or Modula-3. Therefore, modified Harvard architecture or super Harvard architecture might be good. Have one read-only data bus and one write-only data bus for I/O and one read/write data bus for memory access.

Additional processors share memory with host CPU. These additional processors run applications and be called the application processors.

Also have vector floating-point accelerators.

Could have separate I/O processors for networking and storage, but do have one host CPU for memory management, task scheduling, and process management. Maybe coarse-grain reconfigurable array for protocol stack offload and mainframe-style I/O processor (IOP) for I/O operations such as storage and printers. Save the rest for Alternative and Progressive computing.

Composing a mainframe from commodity hardware

Switched fabric like InfiniBand connecting compute nodes together and to Fibre Channel storage, NVMe storage-class memory, and PCI expansion. Also Ethernet gateways, which may be load balancers. Also Myrinet clustering, with a role equivalent to IBM Integrated Coupling Facility. The compute nodes with host channel adapters (HCAs), the storage targets with target channel adapters (TCAs). The HCAs be field-programmable gate arrays (FPGAs) or better yet coarse-grained reconfigurable arrays (CGRAs), the expansion TCAs be application-specific integrated circuits, and the storage TCAs be Systems-on-Chip. Compute nodes should be in modular chassis, while the storage servers be 4U. The Storage TCA Systems-on-Chip would be a Data Center Infrastructure Processing Unit (DCPU, DCIPU, or just DPU) and have a role equivalent to the I/O processors mentioned in the second last paragraph. If printers are used, they would be associated with the storage TCAs which be equivalent to I/O processors. File servers would also run on the storage target DCPUs, and a clustered filesystem would be implemented on them.

A point-to-point interconnect like HyperTransport would be great for the compute nodes. Also have I/O virtualization like 3Leaf Aqua.

Unisys ClearPath is a great example. 3Leaf Dynamic Data Center was near-perfect.

Accelerated processing

Scalar fixed-precision central processing unit (CPU) issuing instructions straight to vector floating-point coprocessors, but both CPU and coprocessors load and store data straight from and to memory. Von Neumann architecture for CPU, Harvard architecture for coprocessors.

Also attached dataflow processor which may be a coarse-grained reconfigurable array.

Aggregated network

Two hubs connected to a bridge. Or multiple switches connected to a router, possibly with hubs connected to each switch.

Edge switches and core routers.

Bus for hubs, semiconductor matrix for switches.

Virtualized operating system

Multi-server microkernel, second- or third-generation, hosting modular monolithic kernels like GNU/Linux. Separate VMM for each virtual machine. Inspired by NOVA hypervisor and its derivatives like Bedrock and Cyberus Hedron SVP.

Another idea is Ubuntu for Xen Dom0 and Debian for DomU. Display server and window manager should be on Dom0. Desire offload of Builder to DomB which should be NanOS. Also offload XenStore to MirageOS or alternatively OSv. Use Xen PVH for most virtual machines, for which prefer paravirtualization for disk and network, paravirtualization or better yet hardware acceleration for interrupts and timers, software virtualization for emulated motherboard, and hardware virtualization for privileged instructions and page tables

Hybrid dataflow and Von Neumann processor array

Dynamic dataflow scheduling between hyperblocks or epochs, control flow scheduling within hyperblocks or epochs and between basic blocks or clusters, and dataflow scheduling within basic blocks or clusters. Therefore, three levels, rather than two like MT.Monsoon, data-driven multithreading, scheduled dataflow, Task Superscalar[2], TRIPS[0], or DYSER[1].

Each hyperblock or epoch may have dataflow-accelerators like DySER[1] rather than unified-dataflow like TRIPS[0], as mentioned in Chapter 2.3.2 of [2].

That would be something like Tartan, Conservation Cores, or DySER[1] nested inside Data-Driven Multithreading or Task Superscalar.

This could be used for datacenter processing unit (DPU) for infrastructure offload.

[0]https://www.cs.utexas.edu/~cart/publications/dissertations/ramdas.pdf

[1]https://ieeexplore.ieee.org/document/6235947/

[2]https://www.tdx.cat/bitstream/handle/10803/277376/TYF1de1.pdf

MultiProcessor System-on-Chip

Inspired by HyperProcessor, MLCA, and Task Superscalar[0,1,2]. But it be a massively parallel processor array.

The host processor issues instructions to the worker processors in the array. The workers pass data to one another as messages. The worker processors may have dataflow accelerators like Tartan or Conservation Cores.

[0]http://www.eecg.toronto.edu/~tsa/papers/wasp03.pdf

[1]https://tspace.library.utoronto.ca/bitstream/1807/11134/1/Capalija_Davor_200806_MASc_thesis.pdf

[2]https://www.tdx.cat/bitstream/handle/10803/277376/TYF1de1.pdf

Distributed timesharing system

One or more workstations each with X Server and/or Wayland Compositor. Multiple remote devices each for different application software. The command-line terminal certainly should be on the display server workstations. Have several remote devices for the applications.

Multi-server microkernel would be great, ideally distributed more like Amoeba than Sprite. Desire combinatorial object-based and page-based distributed shared memory.

The terminals or workstations may have an instant-on operating systems stored in flash memory.

LightDM would be better than SDDM (Simple Desktop Display Manager) and LXDM either of which would be better than XDM (X Display Manager). Directory-based rather than snooping.

Have two or more additional devices for file servers.

MIMD distributed parallel computer

Shared-disk. SCSI HDD arrays be shared and NVMe flash be local.

Message passing between nodes. Each node with a couple or several processor sockets sharing memory. Desire interleaved memory in dancehall configuration for each node. Desire hybrid CUDA/OpenMP/MPI, funneled with full load balancing. RISC processor like DLX/MIPS/Gullwing would be desired, otherwise AMD Opteron or EPYC would be great, alternatively ARM processor. Desire interleaved multithreading to avoid stalls.

Desire byte-addressable NVMe flash and block-addressable SCSI HDD’s. Prefer RAID4, RAID5, RAID6, and RAID10.

Dedicated servers for distributed task scheduling, issuing tasks to the compute nodes. Compute nodes pass data to one another as messages. Have one server to deploy and manage system resources and infrastructure services, partition them into domains, grant access to them by users and groups, and delegate them to particular applications

Could have a separate control network of star, snowflake, clos, or fat-tree topology for task scheduling, while the data network, a structured partial mesh, be for passing data.

Transaction processing, business intelligence, and Big Data analytics

Two-tiers, the upper one being an OLTP or HTAP database for processing of refined, structured data, backed by the lower tier which be a data lake or lakehouse for big data analytics of raw, unstructured and semi-structured data.

Desire a database like LeanXScale, another idea is Greenplum. Hadoop and NoSQL for backing store. Desire dual-model with a key-value engine written in GoLang for transaction processing and relational columnar engine written in either C, C++, Modula-3, or Rust for analytics.

Massively parallel processing (MPP) for data processing, ideally with a coarse-grained reconfigurable array for storage offload. Better yet elastic parallel processing like Snowflake. Desire hybrid transaction and analytical processing (HTAP).

Only an OLTP engine like VoltDB or CockroachDB should be a Docker or Kubernetes container. OLAP should be native and bare-metal.

Rather than Hbase on HDFS, have a Java wide-column store on top of a C/C++ distributed filesystem.

Snowflake schema would be better than star schema. Perhaps better yet would be a fact constellation.

Ideals

Interrupt-driven, port-mapped I/O for USB mouse and keyboard. Direct memory access for FireWire cameras and recorders. Printers and scanners may be either programmed or interrupt-driven I/O in which case they be USB, direct memory access in which case they be FireWire, or channel I/O in which case they be SCSI.

Stack machines may have a dedicated return stack in addition to the data stack, and should have top-of-stack and next-on-stack registers, a couple 3-port register banks each with two read ports and one write port, and register-style load/store instructions. Stack machines should have a power of two stacks i.e. two, four, or (less likely) eight.

Only 0- and 1-operand machines should use stacks for expression evaluation. Even general-purpose register machines certainly should use stacks to store return addresses and possibly subroutine parameters and local variables. No general-purpose register machine, load/store or register-memory, should use stacks for expression evaluation.

USB should be port-mapped, especially keyboard and mouse. While flash memory should have memory-mapped read and port-mapped write, HDD storage should be channel I/O.